Clearly Defining Data Virtualization, Data Federation, and Data Integration
Originally published December 16, 2010
More and more often the terms data virtualization, data federation, and data integration are used. Unfortunately, these terms have never been defined properly. And, as can be expected, this leads to confusing discussions, a misuse of the terms, vendors using the terms the way it benefits them, and so on. Some regard them as synonyms, others see them as overlapping concepts, and there are those who prefer the see them as opposites. Barry Devlin also referred to this discussion in his recent blog published at BeyeNetwork.com: Virtualization, Federation, EII and other non-synonyms†.
Virtualization is not a new concept in the IT industry. It all started years ago when virtual memory was introduced in the 1960s using a technique called paging. Memory virtualization was used to simulate more memory than was physically available in a machine. Nowadays, almost everything can be virtualized, including processors, storage, networks, and operating systems. In general, virtualization means that applications can use a resource without concern for where it resides, what the technical interface is, how it has been implemented, the platform it uses, how large it is, and how much of it is available.
Based on the definitions of those other forms of virtualization, we propose the following definition for data virtualization:
Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.
Data virtualization provides an abstraction layer that data consumers can use to access data in a consistent manner. A data consumer can be any application retrieving or manipulating data, such as a reporting or data entry application. This abstraction layer hides all the technical aspects of data storage. The applications donít have to know where all the data has been stored physically, where the database servers run, what the source API and database language is, and so on.
Technically, data virtualization can be implemented in many different ways. Here are a few examples:Data Federation
In most cases, if the term federation is used, it refers to combining autonomously operating objects. For example, states can be federated to form one country. If we apply this common explanation to data federation, it means combining autonomous data stores to form one large data store. Therefore, we propose the following definition:
Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.
This definition is based on the following concepts:Data Integration
The third term we want to define is data integration. According to SearchCRM, integration (from the Latin word integer, meaning whole or entire) generally means combining parts so that they work together or form a whole. If data from different data sources is brought together, we talk about data integration:
Data integration is the process of combining data from a heterogeneous set of data stores to create one unified view of all that data.
Data integration involves joining data, transforming data values, enriching data, and cleansing data values. What this definition of data integration doesnít enforce is how the integration takes place. For example, it could be that original data is copied from its source data stores, transformed and cleansed, and subsequently stored in another data store. This is the approach taken when using ETL tools. Another solution would be if the integration takes place live. For example, a federation server would do most of the integration work on demand. Another approach is that the source data stores are modified in such a way that data is transformed and cleansed. Itís like changing the sources themselves in such a way that almost no transformations and cleansing are required anymore when data is brought together.
A term that is used in relationship to the three above is enterprise information integration (EII). I have one remark on this term. There is an essential difference between data and information. Data is what is stored and processed in our systems. Users determine whether the data they receive is information or not. Conclusion, we donít integrate information, we integrate data, which could lead to information. Therefore, the term should have been enterprise data integration. That said, EII is a synonym for data integration.
We summarize with a few closing remarks. Data virtualization might not need data integration. It depends on the number of data sources being accessed. Data federation always requires data integration. For data integration, data federation is just one style of integrating data.
Hopefully, these definitions are acceptable to most of you, and as indicated, I appreciate any comments to improve them.
Recent articles by Rick van der Lans
Copyright 2004 — 2017. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC