More and more often the terms data virtualization, data federation, and data integration are used. Unfortunately, these terms have never been defined properly. And, as can be expected, this leads to confusing discussions, a misuse of the terms, vendors using the terms the way it benefits them, and so on. Some regard them as synonyms, others see them as overlapping concepts, and there are those who prefer the see them as opposites. Barry Devlin also referred to this discussion in his recent blog published at BeyeNetwork.com: Virtualization, Federation, EII and other non-synonyms†.
It looks as if everyone assigns their own personal meaning to these terms. This meaning is probably based on personal background, experience with certain products, and on how he or she interprets the words virtualization and federation.
Wikipedia is not helping us either with their definition: ďData virtualization is a method of data integration and is often referred to as data federation, enterprise information integration (EII) or data services.Ē This would imply that data virtualization and data federation are the same.
All this confusion is, as we all understand, not very productive. We need clear definitions. This article, therefore, proposes definitions for these three related terms. I am interested in hearing your reaction, so if you have any comments, please let me know. Letís see if, together, we can come up with generally accepted definitions.
Virtualization is not a new concept in the IT industry. It all started years ago when virtual memory was introduced in the 1960s using a technique called paging. Memory virtualization was used to simulate more memory than was physically available in a machine. Nowadays, almost everything can be virtualized, including processors, storage, networks, and operating systems. In general, virtualization means that applications can use a resource without concern for where it resides, what the technical interface is, how it has been implemented, the platform it uses, how large it is, and how much of it is available.
Based on the definitions of those other forms of virtualization, we propose the following definition for data virtualization:
Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.
Data virtualization provides an abstraction layer that data consumers can use to access data in a consistent manner. A data consumer can be any application retrieving or manipulating data, such as a reporting or data entry application. This abstraction layer hides all the technical aspects of data storage. The applications donít have to know where all the data has been stored physically, where the database servers run, what the source API and database language is, and so on.
Technically, data virtualization can be implemented in many different ways. Here are a few examples:
- With a federation server, multiple data stores can be made to look as one. The applications will see one large data store, while in fact the data is stored in several data stores. More on data federation next.
- An enterprise service bus (ESB) can be used to develop a layer of services that allow access to data. The applications invoking those services will not know where the data is stored, what the original source interface is, how the data is stored, and what its storage structure is. They will only see, for example, a SOAP or REST interface. In this case, the ESB is the abstraction layer.
- Placing data stores in the cloud is also a form of data virtualization. To access a data store, the applications will see the cloud API, but they have has no idea where the data itself resides. Whether the data is stored and managed locally or whether itís stored and managed remotely is transparent.
- In a way, building up a virtual database in memory with data loaded from data stored in physical databases can also be regarded as data virtualization. The storage structure, API, and location of the real data is transparent to the application accessing the in-memory database. In the business intelligence (BI) industry, this is now referred to as in-memory analytics.
- Organizations could also develop their own software-based abstraction layer that hides where and how the data is stored.
In most cases, if the term federation is used, it refers to combining autonomously operating objects. For example, states can be federated to form one country. If we apply this common explanation to data federation, it means combining autonomous data stores to form one large data store. Therefore, we propose the following definition:
Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.
This definition is based on the following concepts:
- Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation. For example, if an organization wants to virtualize the database of one application, no need exists for data federation. But data federation always results in data virtualization.
- Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs. An application using data federation should be able to access different types of database servers and files with various formats; it should be able to integrate data from all those data sources; it should offer features for transforming the data; and it should allow the applications and tools to access the data through various APIs and languages.
- Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation.
- One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data.
- On-demand integration: This refers to when the data from a heterogeneous set of data stores is integrated. With data federation, integration takes place on the fly, and not in batch. When the data consumers ask for data, only then data is accessed and integrated. So the data is not stored in an integrated way, but remains in its original location and format.
The third term we want to define is data integration. According to SearchCRM, integration (from the Latin word integer, meaning whole or entire) generally means combining parts so that they work together or form a whole. If data from different data sources is brought together, we talk about data integration:
Data integration is the process of combining data from a heterogeneous set of data stores to create one unified view of all that data.
Data integration involves joining data, transforming data values, enriching data, and cleansing data values. What this definition of data integration doesnít enforce is how the integration takes place. For example, it could be that original data is copied from its source data stores, transformed and cleansed, and subsequently stored in another data store. This is the approach taken when using ETL tools. Another solution would be if the integration takes place live. For example, a federation server would do most of the integration work on demand. Another approach is that the source data stores are modified in such a way that data is transformed and cleansed. Itís like changing the sources themselves in such a way that almost no transformations and cleansing are required anymore when data is brought together.
A term that is used in relationship to the three above is enterprise information integration
(EII). I have one remark on this term. There is an essential difference between data and information. Data is what is stored and processed in our systems. Users determine whether the data they receive is information or not. Conclusion, we donít integrate information, we integrate data, which could lead to information. Therefore, the term should have been enterprise data integration. That said, EII is a synonym for data integration.
We summarize with a few closing remarks. Data virtualization might not need data integration. It depends on the number of data sources being accessed. Data federation always requires data integration. For data integration, data federation is just one style of integrating data.
Hopefully, these definitions are acceptable to most of you, and as indicated, I appreciate any comments to improve them.
Recent articles by Rick van der Lans