Originally published March 19, 2009
Have you ever been to a fair with the game where a mechanical gopher pops out of a hole and it is your job to whack it when it appears? Once you whack the gopher, it is only a short amount of time before it reemerges from another hole. This is a good game for 4-year-olds, but it is very frustrating for adults.
A virtual data warehouse is like this carnival game. I believe virtual data warehouses are inane. Just when you think this incredibly inane idea has died and just when someone has delivered it what should have been a deathly blow, out it pops again from another hole. The virtual data warehouse just won’t die, no matter how hard or how many times it gets whacked.
The great appeal of the virtual data warehouse is that you do not have to face the problem of integration. For whatever reason, organizations DREAD integrating their data. Managers lay awake at night in a cold sweat just thinking about the awful work of integrating their corporate data. They would rather put their head in a moving cement mixer than integrate their data.
So there is definitely an appeal to the virtual data warehouse. It is an easy way out.
(For the uninitiated, a virtual data warehouse occurs when a query runs around to a lot of databases and does a distributed query. With a distributed query, the analyst does not have to face the dreadful task of integrating data.)
Why then is the virtual data warehouse such a supremely bad idea? There are actually lots of reasons for the vacuity of virtue manifested by the virtual data warehouse. Some of those reasons are:
But there are deeper reasons why a query that has to access a lot of databases simultaneously has some major architectural problems. Those problems stem from the integration of data. In order to illustrate the problems of the integration of data, suppose there are three applications. Application A is written for Australians. It has Australian dollars and centimeters. Application B is written for Americans. It has American dollars and measurements made in feet and inches. Application C is written for Canadians. It has Canadian dollars and measurements made in millimeters.
So what happens when there is a query that has to access a lot of databases simultaneously? Guess what the query has to do every time it accesses application A, B, and C? The analyst must integrate the data. If you want to have meaning, you cannot add Australian dollars to American dollars to Canadian dollars without taking into account their fiscal differences. So integration MUST be done before the data can be useful and meaningful.
The same goes with adding together centimeters, inches and millimeters. You cannot meaningfully add these numbers without accounting for the differences in the measurement of the values. And there are literally hundreds of like conversions that must be made.
The ugly truth is that the corporate analyst MUST do integration whether he or she wants to or not. There is no getting around it. Wishful thinking just does not cut it here.
And – as if there were not enough obstacles – there is the issue of compatible integration. One analyst integrates the data from A, B, and C and arrives at one conclusion. Another analyst integrates data from A, B and C and uses a different algorithm for integration. Now the corporation has two sets of values and no way to reconcile these values. This problem is exacerbated by the fact that EVERY analyst must make his or her own integration algorithms, and there is no guarantee that any two analysts use the same formula or algorithm. When every analyst is free to integrate data as he or she sees fit, then there is no integrity of data. There is no single corporate system of record – there is no corporate single version of the truth.
So the problems with the virtual data warehouse are legion. There is poor performance or no performance at all. There is an enormous system overhead. There is only a limited amount of historical data. There is the work of integration that each analyst must do, one way or the other. There is no reconcilability of data. There is no single version of the truth for the corporation.
The irony is that the analysts in the corporation must integrate its data repeatedly, rather than once.
So if these problems are not an issue for you and your corporation, go right ahead and build your virtual data warehouse. The person that follows you in your job probably will have a more in-depth understanding of the issues.
Recent articles by Bill Inmon