Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

In a series of blogs, I am answering some of the questions a large US-based, health care organization had on data virtualization. I decided to share some of their questions with you, because some of them represent issues that many organizations struggle with.

Their question: "Isn't data virtualization by definition slow, because it's an extra layer of software? And doesn't all the federation, integration, transformation, and cleansing of data that has to take place on-demand, slow down each query?"

This is a question I can't disregard in this series, because performance is an aspect that always worries people when they hear about data virtualization for the first time. In addition, I received a comment on the first blog in this series, which was related to performance.

There is much to say about the performance of data virtualization servers, but because this is a blog, I focus on the key issues.

First of all, some think that the performance of a data virtualization server is by definition poor, because it's accessing source production systems, and not a data warehouse, a data mart, or some other database that is designed and optimized for reporting. It's true that retrieving data from source systems can lead to performance problems. These systems may not have been designed or optimized to run BI queries, or the transaction workload they have to process is so intense that running queries on them as well, can cause serious interference. Therefore, in most cases this is not the recommended approach. A better approach is to design a data warehouse and let the data virtualization server access that data warehouse and not the production systems. Data virtualization does not exclude a data warehouse; also see my blog Do We Still Need A Data Warehouse?

Second, because data virtualization evolved from data federation, some think that data virtualization is only worthwhile when data from multiple data sources is retrieved and integrated--only useful when data is federated. Because data federation can be a resource hungry operation, it can therefore be slow. Evidently, all data virtualization servers do support data federation, and they have various techniques to optimize this federation process. But data virtualization servers are not only useful when data federation is needed. In many systems, data virtualization is used even when each query is a non-federated query. In this case, the strength of data virtualization is encapsulation (hiding all irrelevant technical details of the data stores) and abstraction (showing only relevant data, with the right structure, and on the right aggregation level).

Third, to make access to data sources as efficient as possible, many optimization techniques are implemented in data virtualization servers. For example, join optimization techniques, such as ship joins, query substitution, query pushdown, and query expansion, are implemented to make data access as efficient as possible. These techniques are all very mature and have proven their worth. In fact, research has been going on in this area since the days of the famous IBM's System R* and Ingres projects. Both projects started way back in the 1970s. And research continues--new techniques are still discovered to optimize data access.

Another example of a technique that improves performance is caching. With this technique, the contents of virtual tables (the key building blocks of data virtualization servers) are stored. This means that when a virtual table is accessed, the result is not retrieved from the underlying data sources, but from a cache. The effect is that access of the data sources is not required, data transformation doesn't have to take place, and no data has be integrated or cleansed. No, the data is ready to go. It's like picking up a pizza from a take-away restaurant where you phoned-in your order 30 minutes on forehand.

More and more data virtualization servers offer all kinds of features to store and access caches efficiently. For example, some allow the cache to be stored in the fastest analytical database servers available. It's to be expected that in the near future, data virtualization servers will also support in-memory database servers for storing caches. Undeniably, this will speed up query processing even more, because accessing those cache will involve no I/O.

To summarize, some have worries about the performance of data virtualization servers, but quite regularly those worries are based on the wrong assumptions, such as data virtualization excludes a data warehouse. Data virtualization servers offer enough optimization techniques to process the majority of today's queries fast. However, if you want to run queries that join historical data stored in a data warehouse with data coming from a sentiment analysis executed on textual information straight from Twitter, and join that with production data that still has to be cleansed heavily, then yes, you will have a performance challenge.

Note: For more information on data virtualization, such as query optimization and caching, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.


Posted October 15, 2012 1:24 AM
Permalink | No Comments |

Leave a comment