Bill Inmon posted an article where he
discussed the drawbacks of a "virtual data warehosue." Now the idea of a
virtual data warehouse has been around for years (I remember IBI positioning
their EDASQL technology as a virtual data warehouse fifteen years ago). That
was a terrible idea then because it provided only a method to access data, with
no proper cleansing, integration or persistent storage of it. It was, frankly,
not a data warehouse at all. Using that definition, anyone would agree that a
"virtual data warehouse" was not a workable solution.
The problem is that Inmon defines any method to augment the
data warehouse with access to source data directly as a virtual data warehouse,
and in the same terms as the IBI concept from long ago, citing its
shortcomings, casting the virtual data warehouse as not only a bad idea, but
one that has been historically bad and discredited. In fact, the term "virtual
data warehouse" has been used for years by detractors for any idea that
augmented or supplemented the data warehouse architecture, many of which were
truly misguided ideas. In today's world, though, it is not only possible, it is
necessary to supplement the data warehouse. This "virtual data warehouse" is a
new thing, something not possible before.
Inmon's primary argument about virtual data warehousing is
that it skips the crucial step of integrating data:
The great appeal of the virtual data warehouse is that you
do not have to face the problem of integration.
This
is the central premise of his argument, that a virtual data warehouse skips
data integration. What actually happens in
today's version of a virtual data warehouse (and it shouldn't really be called
that) is that a central data warehouse endures, with the kind of integrated data
that Inmon would approve of. However, since there are so many data sources now,
they change more frequently and the volumes are extreme, the batch load data
warehouse is burdened with latency and often impractical for 100% of the
analytical requirements. Because system architecture has changed so radically
over 15 years, it is possible to read logs, queues, perform changed
data capture and even go directly into the operational
systems without resorting to extraction into staging areas or degrading their
performance. The semantics of these source systems are considerably more
transparent than they used to be and the speed of processing is so much faster
it is possible to cleanse and integrate data on the fly. In those cases
where it is not feasible, or where for a myriad of reasons it makes sense to
warehouse the data, that process is preserved. So the virtual data warehouse is
actually more of a surround strategy than
an alternative.
Why
then is the virtual data warehouse such a supremely bad idea? There are
actually lots of reasons for the vacuity of virtue manifested by the virtual
data warehouse.
I
won't repeat all of his reasons, but they all seem to be related to the old "it
will bring the system to its knees" claims about the inefficiency of
federated queries and the affect on systems and network performance. I call
this reasoning "managing
from scarcity," except there is no scarcity anymore. Again, not every
query is federated; many will be satisfied by the data warehouse. We can build systems
thousands of times larger and millions of times faster than we could two
decades ago as data warehousing was catching on and isolation of data and
processing was absolutely necessary. Today's applications and business
processes call for faster, fresher data than a data warehouse can provide, in
many cases. Rather than ignore these requirements, a virtual data warehouse
(with adequate semantic
rationalization) gives the developers a chance to position the processing
to the logical location, and the ability to use abstraction (or indirection if
you prefer that word) to address the data, providing the flexibility to change
that configuration whenever necessary.
The
ugly truth is that the corporate analyst MUST do integration whether he or she
wants to or not. There is no getting around it. Wishful thinking just does not
cut it here.
Who
wouldn't agree with that? The question is, where and when? The bulk of the
difficult integration work is manual - people gathering information and getting
buy-in that they have the right mapping. Where that process gets implemented is
not the issue. There is no rule that says it must all go to a data warehouse.
Historically, that was the practice because of the constraint of resource
limitations that are now mostly relaxed.
So the
problems with the virtual data warehouse are legion. There is poor performance
or no performance at all. There is an enormous system overhead. There is only a
limited amount of historical data. There is the work of integration that each
analyst must do, one way or the other. There is no reconcilability of data.
There is no single version of the truth for the corporation.
All of
these things are true in the virtual data warehouse the way Inmon defines it, a
big mess of federated queries directly to source systems with no integration,
no history, no real context. No one would suggest this is a good idea. But a
data warehouse today should be a participant in a constellation of source data,
so long as the access methods to data beyond the warehouse's control can be
relied upon to be correct, timely and not a burden on other systems. This kind
of arrangement is not a virtual data warehouse, it's an analytical framework.
We should discuss the merits of the idea as such, not dismiss them because we
name them with a long-discredited concept.