Channel: Data Delivery Platform - Rick van der Lans RSS Feed for Data Delivery Platform - Rick van der Lans

 

Clearly Defining Data Virtualization, Data Federation, and Data Integration

Originally published December 16, 2010

More and more often the terms data virtualization, data federation, and data integration are used. Unfortunately, these terms have never been defined properly. And, as can be expected, this leads to confusing discussions, a misuse of the terms, vendors using the terms the way it benefits them, and so on. Some regard them as synonyms, others see them as overlapping concepts, and there are those who prefer the see them as opposites. Barry Devlin also referred to this discussion in his recent blog published at BeyeNetwork.com: Virtualization, Federation, EII and other non-synonyms.

It looks as if everyone assigns their own personal meaning to these terms. This meaning is probably based on personal background, experience with certain products, and on how he or she interprets the words virtualization and federation.

Wikipedia is not helping us either with their definition: ďData virtualization is a method of data integration and is often referred to as data federation, enterprise information integration (EII) or data services.Ē This would imply that data virtualization and data federation are the same.

All this confusion is, as we all understand, not very productive. We need clear definitions. This article, therefore, proposes definitions for these three related terms. I am interested in hearing your reaction, so if you have any comments, please let me know. Letís see if, together, we can come up with generally accepted definitions.

Data Virtualization

Virtualization is not a new concept in the IT industry. It all started years ago when virtual memory was introduced in the 1960s using a technique called paging. Memory virtualization was used to simulate more memory than was physically available in a machine. Nowadays, almost everything can be virtualized, including processors, storage, networks, and operating systems. In general, virtualization means that applications can use a resource without concern for where it resides, what the technical interface is, how it has been implemented, the platform it uses, how large it is, and how much of it is available.

Based on the definitions of those other forms of virtualization, we propose the following definition for data virtualization:

Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.

Data virtualization provides an abstraction layer that data consumers can use to access data in a consistent manner. A data consumer can be any application retrieving or manipulating data, such as a reporting or data entry application. This abstraction layer hides all the technical aspects of data storage. The applications donít have to know where all the data has been stored physically, where the database servers run, what the source API and database language is, and so on.

Technically, data virtualization can be implemented in many different ways. Here are a few examples:
  • With a federation server, multiple data stores can be made to look as one. The applications will see one large data store, while in fact the data is stored in several data stores. More on data federation next.

  • An enterprise service bus (ESB) can be used to develop a layer of services that allow access to data. The applications invoking those services will not know where the data is stored, what the original source interface is, how the data is stored, and what its storage structure is. They will only see, for example, a SOAP or REST interface. In this case, the ESB is the abstraction layer.

  • Placing data stores in the cloud is also a form of data virtualization. To access a data store, the applications will see the cloud API, but they have has no idea where the data itself resides. Whether the data is stored and managed locally or whether itís stored and managed remotely is transparent.

  • In a way, building up a virtual database in memory with data loaded from data stored in physical databases can also be regarded as data virtualization. The storage structure, API, and location of the real data is transparent to the application accessing the in-memory database. In the business intelligence (BI) industry, this is now referred to as in-memory analytics.

  • Organizations could also develop their own software-based abstraction layer that hides where and how the data is stored.
Data Federation

In most cases, if the term federation is used, it refers to combining autonomously operating objects. For example, states can be federated to form one country. If we apply this common explanation to data federation, it means combining autonomous data stores to form one large data store. Therefore, we propose the following definition:

Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.

This definition is based on the following concepts:
  • Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation. For example, if an organization wants to virtualize the database of one application, no need exists for data federation. But data federation always results in data virtualization.

  • Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs. An application using data federation should be able to access different types of database servers and files with various formats; it should be able to integrate data from all those data sources; it should offer features for transforming the data; and it should allow the applications and tools to access the data through various APIs and languages.

  • Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation.

  • One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data.

  • On-demand integration: This refers to when the data from a heterogeneous set of data stores is integrated. With data federation, integration takes place on the fly, and not in batch. When the data consumers ask for data, only then data is accessed and integrated. So the data is not stored in an integrated way, but remains in its original location and format.
Data Integration

The third term we want to define is data integration. According to SearchCRM, integration (from the Latin word integer, meaning whole or entire) generally means combining parts so that they work together or form a whole. If data from different data sources is brought together, we talk about data integration:

Data integration is the process of combining data from a heterogeneous set of data stores to create one unified view of all that data.

Data integration involves joining data, transforming data values, enriching data, and cleansing data values. What this definition of data integration doesnít enforce is how the integration takes place. For example, it could be that original data is copied from its source data stores, transformed and cleansed, and subsequently stored in another data store. This is the approach taken when using ETL tools. Another solution would be if the integration takes place live. For example, a federation server would do most of the integration work on demand. Another approach is that the source data stores are modified in such a way that data is transformed and cleansed. Itís like changing the sources themselves in such a way that almost no transformations and cleansing are required anymore when data is brought together.

A term that is used in relationship to the three above is enterprise information integration (EII). I have one remark on this term. There is an essential difference between data and information. Data is what is stored and processed in our systems. Users determine whether the data they receive is information or not. Conclusion, we donít integrate information, we integrate data, which could lead to information. Therefore, the term should have been enterprise data integration. That said, EII is a synonym for data integration.

We summarize with a few closing remarks. Data virtualization might not need data integration. It depends on the number of data sources being accessed. Data federation always requires data integration. For data integration, data federation is just one style of integrating data.

Hopefully, these definitions are acceptable to most of you, and as indicated, I appreciate any comments to improve them.

SOURCE: Clearly Defining Data Virtualization, Data Federation, and Data Integration

  • Rick van der LansRick van der Lans

    Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

    Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Recent articles by Rick van der Lans



 

Comments

Want to post a comment? Login or become a member today!

Posted November 30, 2011 by Ash Parikh

Rick,

Here is a link to some recent information on Informatica's data virtualization technology:

http://www.informatica.com/us/products/data-virtualization/

http://www.informatica.com/us/products/data-virtualization/data-services/

http://vip.informatica.com/?elqPURLPage=8668

Some of my thoughts are captured in this TDWI article:

http://tdwi.org/articles/2011/10/04/informatica-data-virtualization-power.aspx

And almost 1800 architects are sharing their insights and views on data virtualization on the Data Virtualization and Data Services Architecture group on LinkedIn:

http://www.linkedin.com/groups/Data-Virtualization-Data-Services-Architecture-2934783?mostPopular=&gid=2934783

Regards,

Ash

Is this comment inappropriate? Click here to flag this comment.

Posted August 26, 2011 by Ash Parikh

Rick,

Here is a recent article by Linda Briggs on data virtualization and how to address data integration issues with "true" data virtualization:

http://tdwi.org/articles/2011/06/15/virtualization-and-data-integration-issues.aspx

The best practices for enabling agile data integration with "true" data virtualization are showcased here:

http://tdwi.org/Articles/2011/06/15/Virtualization-and-Data-Integration-Issues.aspx?Page=4

Informatica Data Services, Informatica's data virtualization solution is an example of "true" data virtualization. More information on the best practices discussed are available at:

http://www.informatica.com/Pages/data_virtualization_index.aspx

http://www.informatica.com/products_services/data_services/Pages/index.aspx

Thanks,

 

Ash

Is this comment inappropriate? Click here to flag this comment.

Posted June 14, 2011 by Ash Parikh aparikh@informatica.com

Rick,

Informatica recently released the latest version of its data virtualization solution, Informatica Data Services version 9.1, as part of the Informatica 9.1 Platform.

http://www.informatica.com/products_services/data_services/Pages/index.aspx

Key highlights for this release are:

  • The capability to dynamically mask federated data as it in flight, without processing or staging, just like what we were doing with the full palate of data quality and complex ETL-like data transformations before. This is helping end users leverage a rich set of data transformations, data quality, and data masking capabilities in real-time.
  • The ability for business users (analysts) to play a bigger role in the Agile Data Integration Process, and work closely with IT users (architects and developers), using role-based tools. This is helping in accelerating the data integration process, with self-service capabilities.
  • The ability to instantly reuse data services for any application, whether it is a BI tool or composite application or portal, without re-delpoyment or re-building the data integration logic. This is done graphically in a metadata-driven environment, increasing agility and productivity.

Here is a demo and chalk talk:

Thanks,

Ash Parikh

Is this comment inappropriate? Click here to flag this comment.

Posted April 27, 2011 by Ash Parikh

Here is the article written by Rick on Informatica Data Services - Informatica's data virtualization technology.

http://www.b-eye-network.com/blogs/vanderlans/archives/2011/04/data_virtualiza.php

A key aspect that needs on-going clarification is that simple, traditional data federation is not data virtualization. Data virtualization is all about hiding and handling complexity. Dealing with enterprise data is a complex proposition which calls for rich transformations and not limiting the user to SQL or XQuery only transformations. Additionally, simple, traditional data federation assumes that the data in the backend data stores is ready for consumption - that it is of good quality, which is not the case. You cannot just propagate bad data in real-time and then loose the advantage gained to post-processing. You need to do these complex transformations including data quality on the fly, on the federated data.

More information on Informatica Data Services can be found here:

http://www.informatica.com/products_services/data_services/Pages/index.aspx

Rick covers this in great detail in this technical whitepaper:

http://vip.informatica.com/?elqPURLPage=6011&docid=1571&lsc=NA-Ongoing-2011Q1-JP-DI_Developing_Data_Delivery_Platform_WP_www

Thanks,

 

Ash Parikh

Is this comment inappropriate? Click here to flag this comment.

Posted January 7, 2011 by

Hello Rick,

Thanks for your reply.

Yes, your answer brings it more in line, but as a result it leaves me with another question. If data virtualization is a possible implementation of data federation and if the term data federation does not address physical storage, than we probably are juggling terms on a level that is not essential. IMO essential characteristics of a solution dubbed with term X (be that Data Federation, Data Virtualization or the new term that probably will hit us this year!) should only address the following:

1) physical storage yes or no

2) integration of data from multiple sources yes or no

3) on-demand integration or pre-integration

Maybe we should also add publish/subscribe capabilities to the big three above. But when we have reached agreement on what the relevant characteristics of any integration solution should be and after that suppliers should make clear whether there solution Y entails these aspects. For you and I know that no supplier will comply to any definition of X we will come up with here.

Is this comment inappropriate? Click here to flag this comment.

Posted January 7, 2011 by Rick van der Lans rick@r20.nl

Hi Ron,

Yes, the idea of federation is to make all the data stores look like one integrated data store. However, a data store doesn't have to be physical. And the one created by a federation probably isn't, it's probably virtual. Does that make it more in line with your thoughts?

 

Is this comment inappropriate? Click here to flag this comment.

Posted January 5, 2011 by Ron Tijhaar

Does data federation necessarily involve an integrated data store as you suggest?

Of course you can define data federation that way, and maybe it is for the better, since our industry is littered with vague terms popping up everywhere anytime. But in my experience, when we say things like "we've federated our data domains" we primarily address a governance aspect, not an implementation aspect. We mean to say that business ownership over data is not completely centralized. Whether or not this involves a central data store is quite another matter. It may be a consequence of the need for integrated data or it may not. We may implement a federation server or we can go for complete data virtualization. In short, data federation IMO is a view on data governance whereas data virtualization and integrated data stores are implementation patterns to fulfill integration needs.

Is this comment inappropriate? Click here to flag this comment.

Posted December 16, 2010 by Anonymous

Rick -

You have done the IT industry and users a good service by providing clarity around these important terms. 

This work, along with your earlier Data Delivery Platform definitions, will help IT strategists, information architectures, and Integration Competency Center leaders better understand how to position and deploy these powerful and now well-proven capabilities at their organizations.

Robert Eve, EVP Composite Software

Is this comment inappropriate? Click here to flag this comment.