We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Clearly Defining Data Virtualization, Data Federation, and Data Integration

Originally published December 16, 2010

More and more often the terms data virtualization, data federation, and data integration are used. Unfortunately, these terms have never been defined properly. And, as can be expected, this leads to confusing discussions, a misuse of the terms, vendors using the terms the way it benefits them, and so on. Some regard them as synonyms, others see them as overlapping concepts, and there are those who prefer the see them as opposites. Barry Devlin also referred to this discussion in his recent blog published at BeyeNetwork.com: Virtualization, Federation, EII and other non-synonyms.

It looks as if everyone assigns their own personal meaning to these terms. This meaning is probably based on personal background, experience with certain products, and on how he or she interprets the words virtualization and federation.

Wikipedia is not helping us either with their definition: ďData virtualization is a method of data integration and is often referred to as data federation, enterprise information integration (EII) or data services.Ē This would imply that data virtualization and data federation are the same.

All this confusion is, as we all understand, not very productive. We need clear definitions. This article, therefore, proposes definitions for these three related terms. I am interested in hearing your reaction, so if you have any comments, please let me know. Letís see if, together, we can come up with generally accepted definitions.

Data Virtualization

Virtualization is not a new concept in the IT industry. It all started years ago when virtual memory was introduced in the 1960s using a technique called paging. Memory virtualization was used to simulate more memory than was physically available in a machine. Nowadays, almost everything can be virtualized, including processors, storage, networks, and operating systems. In general, virtualization means that applications can use a resource without concern for where it resides, what the technical interface is, how it has been implemented, the platform it uses, how large it is, and how much of it is available.

Based on the definitions of those other forms of virtualization, we propose the following definition for data virtualization:

Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.

Data virtualization provides an abstraction layer that data consumers can use to access data in a consistent manner. A data consumer can be any application retrieving or manipulating data, such as a reporting or data entry application. This abstraction layer hides all the technical aspects of data storage. The applications donít have to know where all the data has been stored physically, where the database servers run, what the source API and database language is, and so on.

Technically, data virtualization can be implemented in many different ways. Here are a few examples:
  • With a federation server, multiple data stores can be made to look as one. The applications will see one large data store, while in fact the data is stored in several data stores. More on data federation next.

  • An enterprise service bus (ESB) can be used to develop a layer of services that allow access to data. The applications invoking those services will not know where the data is stored, what the original source interface is, how the data is stored, and what its storage structure is. They will only see, for example, a SOAP or REST interface. In this case, the ESB is the abstraction layer.

  • Placing data stores in the cloud is also a form of data virtualization. To access a data store, the applications will see the cloud API, but they have has no idea where the data itself resides. Whether the data is stored and managed locally or whether itís stored and managed remotely is transparent.

  • In a way, building up a virtual database in memory with data loaded from data stored in physical databases can also be regarded as data virtualization. The storage structure, API, and location of the real data is transparent to the application accessing the in-memory database. In the business intelligence (BI) industry, this is now referred to as in-memory analytics.

  • Organizations could also develop their own software-based abstraction layer that hides where and how the data is stored.
Data Federation

In most cases, if the term federation is used, it refers to combining autonomously operating objects. For example, states can be federated to form one country. If we apply this common explanation to data federation, it means combining autonomous data stores to form one large data store. Therefore, we propose the following definition:

Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.

This definition is based on the following concepts:
  • Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation. For example, if an organization wants to virtualize the database of one application, no need exists for data federation. But data federation always results in data virtualization.

  • Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs. An application using data federation should be able to access different types of database servers and files with various formats; it should be able to integrate data from all those data sources; it should offer features for transforming the data; and it should allow the applications and tools to access the data through various APIs and languages.

  • Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation.

  • One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data.

  • On-demand integration: This refers to when the data from a heterogeneous set of data stores is integrated. With data federation, integration takes place on the fly, and not in batch. When the data consumers ask for data, only then data is accessed and integrated. So the data is not stored in an integrated way, but remains in its original location and format.
Data Integration

The third term we want to define is data integration. According to SearchCRM, integration (from the Latin word integer, meaning whole or entire) generally means combining parts so that they work together or form a whole. If data from different data sources is brought together, we talk about data integration:

Data integration is the process of combining data from a heterogeneous set of data stores to create one unified view of all that data.

Data integration involves joining data, transforming data values, enriching data, and cleansing data values. What this definition of data integration doesnít enforce is how the integration takes place. For example, it could be that original data is copied from its source data stores, transformed and cleansed, and subsequently stored in another data store. This is the approach taken when using ETL tools. Another solution would be if the integration takes place live. For example, a federation server would do most of the integration work on demand. Another approach is that the source data stores are modified in such a way that data is transformed and cleansed. Itís like changing the sources themselves in such a way that almost no transformations and cleansing are required anymore when data is brought together.

A term that is used in relationship to the three above is enterprise information integration (EII). I have one remark on this term. There is an essential difference between data and information. Data is what is stored and processed in our systems. Users determine whether the data they receive is information or not. Conclusion, we donít integrate information, we integrate data, which could lead to information. Therefore, the term should have been enterprise data integration. That said, EII is a synonym for data integration.

We summarize with a few closing remarks. Data virtualization might not need data integration. It depends on the number of data sources being accessed. Data federation always requires data integration. For data integration, data federation is just one style of integrating data.

Hopefully, these definitions are acceptable to most of you, and as indicated, I appreciate any comments to improve them.

  • Rick van der LansRick van der Lans

    Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

    Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Recent articles by Rick van der Lans

 

Comments

Want to post a comment? Login or become a member today!

Posted August 26, 2011 by Ash Parikh

Rick,

Here is a recent article by Linda Briggs on data virtualization and how to address data integration issues with "true" data virtualization:

http://tdwi.org/articles/2011/06/15/virtualization-and-data-integration-issues.aspx

The best practices for enabling agile data integration with "true" data virtualization are showcased here:

http://tdwi.org/Articles/2011/06/15/Virtualization-and-Data-Integration-Issues.aspx?Page=4

Informatica Data Services, Informatica's data virtualization solution is an example of "true" data virtualization. More information on the best practices discussed are available at:

http://www.informatica.com/Pages/data_virtualization_index.aspx

http://www.informatica.com/products_services/data_services/Pages/index.aspx

Thanks,

 

Ash

Is this comment inappropriate? Click here to flag this comment.

Posted June 14, 2011 by Ash Parikh aparikh@informatica.com

Rick,

Informatica recently released the latest version of its data virtualization solution, Informatica Data Services version 9.1, as part of the Informatica 9.1 Platform.

http://www.informatica.com/products_services/data_services/Pages/index.aspx

Key highlights for this release are:

  • The capability to dynamically mask federated data as it in flight, without processing or staging, just like what we were doing with the full palate of data quality and complex ETL-like data transformations before. This is helping end users leverage a rich set of data transformations, data quality, and data masking capabilities in real-time.
  • The ability for business users (analysts) to play a bigger role in the Agile Data Integration Process, and work closely with IT users (architects and developers), using role-based tools. This is helping in accelerating the data integration process, with self-service capabilities.
  • The ability to instantly reuse data services for any application, whether it is a BI tool or composite application or portal, without re-delpoyment or re-building the data integration logic. This is done graphically in a metadata-driven environment, increasing agility and productivity.

Here is a demo and chalk talk:

Thanks,

Ash Parikh

Is this comment inappropriate? Click here to flag this comment.

Posted April 27, 2011 by Ash Parikh

Here is the article written by Rick on Informatica Data Services - Informatica's data virtualization technology.

http://www.b-eye-network.com/blogs/vanderlans/archives/2011/04/data_virtualiza.php

A key aspect that needs on-going clarification is that simple, traditional data federation is not data virtualization. Data virtualization is all about hiding and handling complexity. Dealing with enterprise data is a complex proposition which calls for rich transformations and not limiting the user to SQL or XQuery only transformations. Additionally, simple, traditional data federation assumes that the data in the backend data stores is ready for consumption - that it is of good quality, which is not the case. You cannot just propagate bad data in real-time and then loose the advantage gained to post-processing. You need to do these complex transformations including data quality on the fly, on the federated data.

More information on Informatica Data Services can be found here:

http://www.informatica.com/products_services/data_services/Pages/index.aspx

Rick covers this in great detail in this technical whitepaper:

http://vip.informatica.com/?elqPURLPage=6011&docid=1571&lsc=NA-Ongoing-2011Q1-JP-DI_Developing_Data_Delivery_Platform_WP_www

Thanks,

 

Ash Parikh

Is this comment inappropriate? Click here to flag this comment.