We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

The Elusive Virtual Data Warehouse

Originally published March 19, 2009

Have you ever been to a fair with the game where a mechanical gopher pops out of a hole and it is your job to whack it when it appears? Once you whack the gopher, it is only a short amount of time before it reemerges from another hole. This is a good game for 4-year-olds, but it is very frustrating for adults.

A virtual data warehouse is like this carnival game. I believe virtual data warehouses are inane. Just when you think this incredibly inane idea has died and just when someone has delivered it what should have been a deathly blow, out it pops again from another hole. The virtual data warehouse just won’t die, no matter how hard or how many times it gets whacked.

The great appeal of the virtual data warehouse is that you do not have to face the problem of integration. For whatever reason, organizations DREAD integrating their data. Managers lay awake at night in a cold sweat just thinking about the awful work of integrating their corporate data. They would rather put their head in a moving cement mixer than integrate their data.

So there is definitely an appeal to the virtual data warehouse. It is an easy way out.

(For the uninitiated, a virtual data warehouse occurs when a query runs around to a lot of databases and does a distributed query. With a distributed query, the analyst does not have to face the dreadful task of integrating data.)

Why then is the virtual data warehouse such a supremely bad idea? There are actually lots of reasons for the vacuity of virtue manifested by the virtual data warehouse. Some of those reasons are:

  • A query that has to access a lot of databases simultaneously uses a lot of system resources. In the best of circumstances, query performance is a real problem.

  • A query that has to access a lot of databases simultaneously requires resources every time it is executed. If the query is run many times at all, the system overhead is very steep.

  • A query that has to access a lot of databases simultaneously is stopped dead in its tracks when it runs across a database that is down or otherwise unavailable.

  • A query that has to access a lot of databases simultaneously shuffles a lot of data around the system that otherwise would not need to be moved. The impact on the network can become very burdensome.

  • A query that has to access a lot of databases simultaneously is limited to the data found in the databases. If there is only a limited amount of historical data in the databases, the query is limited to whatever historical data is found there. For a variety of reasons, many application databases do not have much historical data to begin with.

But there are deeper reasons why a query that has to access a lot of databases simultaneously has some major architectural problems. Those problems stem from the integration of data. In order to illustrate the problems of the integration of data, suppose there are three applications. Application A is written for Australians. It has Australian dollars and centimeters. Application B is written for Americans. It has American dollars and measurements made in feet and inches. Application C is written for Canadians. It has Canadian dollars and measurements made in millimeters.

So what happens when there is a query that has to access a lot of databases simultaneously? Guess what the query has to do every time it accesses application A, B, and C? The analyst must integrate the data. If you want to have meaning, you cannot add Australian dollars to American dollars to Canadian dollars without taking into account their fiscal differences. So integration MUST be done before the data can be useful and meaningful.

The same goes with adding together centimeters, inches and millimeters. You cannot meaningfully add these numbers without accounting for the differences in the measurement of the values. And there are literally hundreds of like conversions that must be made.

The ugly truth is that the corporate analyst MUST do integration whether he or she wants to or not. There is no getting around it. Wishful thinking just does not cut it here.

And – as if there were not enough obstacles – there is the issue of compatible integration. One analyst integrates the data from A, B, and C and arrives at one conclusion. Another analyst integrates data from A, B and C and uses a different algorithm for integration. Now the corporation has two sets of values and no way to reconcile these values. This problem is exacerbated by the fact that EVERY analyst must make his or her own integration algorithms, and there is no guarantee that any two analysts use the same formula or algorithm. When every analyst is free to integrate data as he or she sees fit, then there is no integrity of data. There is no single corporate system of record – there is no corporate single version of the truth.

So the problems with the virtual data warehouse are legion. There is poor performance or no performance at all. There is an enormous system overhead. There is only a limited amount of historical data. There is the work of integration that each analyst must do, one way or the other. There is no reconcilability of data. There is no single version of the truth for the corporation.

The irony is that the analysts in the corporation must integrate its data repeatedly, rather than once.

So if these problems are not an issue for you and your corporation, go right ahead and build your virtual data warehouse. The person that follows you in your job probably will have a more in-depth understanding of the issues.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon



Want to post a comment? Login or become a member today!

Posted March 30, 2009 by Andrew Powell

Bill tells it like it is and I agree that it is a carnival game minus the cheap prizes and cotton candy.

Is this comment inappropriate? Click here to flag this comment.

Posted March 26, 2009 by Joseph Subits

On the subject of virtual data warehouses, my jury is still out because as hardware speed, memory, network and other related infrastructure continues to increase in capability while continuing to come down in cost, many of the traditional "physical" barriers to the virtual data warehouse will dissappear.  That said, I applaud Bill on emphasizing or re-emphasizing the great Achilles heel of virtual data warehouses or traditional data warehouses for that matter.  Any company that does not solve their data integration, data quality and data management issues will ultimately fail to leverage the power of data warehousing, business intelligence, enterprise application integration, e-commerce...etc. in any significant form that it is used.  It appears that the ultimate challenge of the 21st century is to banish the concept of GIGO once and forever.  That said, we must always remember that many of the information silos that exist in companies and why there are so many spreadsheets and so many arguments over who has the "right" data is that information is power and human beings unfortunately don't share power willingly or easily.  Thus, on the whole, we tend to culturally resist "one version of the facts".   

Is this comment inappropriate? Click here to flag this comment.

Posted March 24, 2009 by Neil Raden nraden@hiredbrains.com

This article defines virtual data warehouse in a way that no one in their right mind would support. However, supplementing the data warehouse with other sources that are not loaded into it is a reasonable appraoch these days. I discuss is this article and some alternatives in my blog at http://www.b-eye-network.com/blogs/raden/archives/2009/03/time_to_rexamine_the_virtual_data_warehouse.php

-Neil Raden

Is this comment inappropriate? Click here to flag this comment.

Posted March 20, 2009 by Robert Eve

Virtual Data Warehouses do not make sense.  But there are many cases where Virtual Data Marts and Virtualized Federated Query do, often complementing and extending the value of the Physical Data Warehouse.  This article seems to tar everything virtual with the same brush. 

Of course there are limitations to federated query.  But building and supporting Enterprise Data Warehouse must have a few as well.  Otherwise everybody would have them and they would all be working great.  And we all know that isn't true.  So should we tar all EDWs as well?

Because there are lots of data silos and complexity, diverse needs, various degrees of maturity, and many other factors in the mix, lets face the fact that the world is not so BLACK and WHITE.  And a palette of data integration options is in fact a good thing.


Is this comment inappropriate? Click here to flag this comment.