VLDB (very large databases) and VLDW (very large data warehousing) are two different terms in the industry that evoke a lot of stir. The terms have been changed, altered, re-defined, and brought back to the table many times by many people. Their are many problems associated with implementing "big systems" and not very many solutions (although vendors are trying). There are some major business questions around the data sets and the application of such large data sets.
In this entry I will explore the business questions, and the technical challenges faced by big systems. I will attempt to hold my opinion, and see what the responses are - what do you think are issues faced by your business?
First, as always, let's level-set the terms by defining what we mean by "big systems".
VLDB - A large database, with large amounts of information being loaded by a trickle feed, and large amounts of information being queried 24x7x365 (always up). This creates a mixed workload environment. An example system might be a telephone switch data capturing system hit by Quality control and financial analysts looking to see where they are loosing and making money NOW (all current information). Typically sized in the ranges from 50TB to 150TB of operational type data.
VLDW - A large database, inclusive of history (making it a data warehouse) at a granular level. Typically loaded anywhere between 3 minute intervals and 24 hour intervals, with queries against large amounts of history, mixed in with queries that are "wide" but not "deep" - mixed workload, 24x7x365, detailed data set, raw data set. An example might be all the history of the telephone switching systems mentioned above, so the analysts can determine over time which switches/hosting facilities have the most problems, and which bandwidth is frequently overloaded, and what the patterns of overload actually are. Typically sized in the ranges of 150TB to well over 800TB of historical information (that is ACCESSED).
I'm not discussing systems where "I have 800TB, but it's all on storage, and we load weekly..." - no, that's not what I'm talking about.
The business questions that are under controversy include: (remember, I'm going to hold my opinionated answers until later)
1) Do we really NEED all this data? What does it buy the business? What can be learned from this?
2) What could possibly be hidden in 800TB that the business users access?
3) What tactical questions are answered by having raw data (transactions) loaded to the VLDW?
4) Why can't the operational system (VLDB) serve as the system of record?
5) What does the VLDW have that the VLDB doesn't? Why do I need to justify the existence of both?
6) How do I mitigate risk of failure of either system?
7) Do I need replication technology instead of "backup" technology for fail-over and recoverability?
8) Is there a SINGLE RDBMS engine that will answer these questions AND scale beyond?
9) Do I need to scale beyond 300/400/800TB? What will that buy me?
And the technical questions:
1) How do I manage backups and restores of this much information?
2) is Data Modeling really necessary?
3) Why can't I cluster my machines together, how come I need MPP or Big-Iron SMP to make this work?
4) How do I get the DBMS to handle mixed-workload queries?
5) Why does the system "go-down" when I fire up massive loads WHILE querying?
6) Why do vendors continue to push TPC-H performance when that isn't my "real-world"?
7) What's the difference in systems at 300TB and systems at 800TB?
8) What changes to my architecture/network/OS do I need to make to accomodate this scale?
9) Why can't the users get along with "LESS DATA?" Do they really use all of this?
Love to hear your thoughts,
Posted September 25, 2008 12:15 PM
Permalink | 1 Comment |