Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

VLDB (very large databases) and VLDW (very large data warehousing) are two different terms in the industry that evoke a lot of stir. The terms have been changed, altered, re-defined, and brought back to the table many times by many people. Their are many problems associated with implementing "big systems" and not very many solutions (although vendors are trying). There are some major business questions around the data sets and the application of such large data sets.

In this entry I will explore the business questions, and the technical challenges faced by big systems. I will attempt to hold my opinion, and see what the responses are - what do you think are issues faced by your business?

First, as always, let's level-set the terms by defining what we mean by "big systems".

VLDB - A large database, with large amounts of information being loaded by a trickle feed, and large amounts of information being queried 24x7x365 (always up). This creates a mixed workload environment. An example system might be a telephone switch data capturing system hit by Quality control and financial analysts looking to see where they are loosing and making money NOW (all current information). Typically sized in the ranges from 50TB to 150TB of operational type data.

VLDW - A large database, inclusive of history (making it a data warehouse) at a granular level. Typically loaded anywhere between 3 minute intervals and 24 hour intervals, with queries against large amounts of history, mixed in with queries that are "wide" but not "deep" - mixed workload, 24x7x365, detailed data set, raw data set. An example might be all the history of the telephone switching systems mentioned above, so the analysts can determine over time which switches/hosting facilities have the most problems, and which bandwidth is frequently overloaded, and what the patterns of overload actually are. Typically sized in the ranges of 150TB to well over 800TB of historical information (that is ACCESSED).

I'm not discussing systems where "I have 800TB, but it's all on storage, and we load weekly..." - no, that's not what I'm talking about.

The business questions that are under controversy include: (remember, I'm going to hold my opinionated answers until later)
1) Do we really NEED all this data? What does it buy the business? What can be learned from this?

2) What could possibly be hidden in 800TB that the business users access?

3) What tactical questions are answered by having raw data (transactions) loaded to the VLDW?

4) Why can't the operational system (VLDB) serve as the system of record?

5) What does the VLDW have that the VLDB doesn't? Why do I need to justify the existence of both?

6) How do I mitigate risk of failure of either system?

7) Do I need replication technology instead of "backup" technology for fail-over and recoverability?

8) Is there a SINGLE RDBMS engine that will answer these questions AND scale beyond?

9) Do I need to scale beyond 300/400/800TB? What will that buy me?

And the technical questions:
1) How do I manage backups and restores of this much information?
2) is Data Modeling really necessary?
3) Why can't I cluster my machines together, how come I need MPP or Big-Iron SMP to make this work?
4) How do I get the DBMS to handle mixed-workload queries?
5) Why does the system "go-down" when I fire up massive loads WHILE querying?
6) Why do vendors continue to push TPC-H performance when that isn't my "real-world"?
7) What's the difference in systems at 300TB and systems at 800TB?
8) What changes to my architecture/network/OS do I need to make to accomodate this scale?
9) Why can't the users get along with "LESS DATA?" Do they really use all of this?

Love to hear your thoughts,
Dan L
DanL@DanLinstedt.com


Posted September 25, 2008 12:15 PM
Permalink | 1 Comment |

1 Comment

VLDB/VLDW Expected Issues

Do you have the responses to these questions? I am working on various VLDB.. thanks!

Thanks,
Kata

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›