We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

There's a lot of talk in the industry today about VLDW/VLDB (very large data sets), and how too much data might not be such a good thing. I take a different opinion on this subject. In this blog I hope to explore the following questions: What is VLDW/VLDB? What are some of the problems with it? What kinds of ROI multipliers might I find in a big-data set?

I've recently had discussions with a major credit card processor, and as a result will share with you some of the common issues that they face daily.

VLDW/VLDB is defined to be big data, does it mean we have a 1TB, or 10TB or 100TB data store sitting there? No, if the data is sitting there, and is not used for business purposes then by all means - it shouldn't be stored on-line (due to cost), or the business may not be looking at their information hard enough or with the right questions to use all the data.

Something to think about: Data Mining has begun to be a viable solution to providing analytics, trend analysis, and forecasting above and beyond traditional statistics. In other words, companies with extreme competitive advantage are using Data Mining to reach and discover things about their business that they didn't previously know, or to predict future outcomes with a confidence rating that enables business decisions that make sense.

Having big data and using it are two different things. If you use 80% or better of your big-data sets, then you have a VLDB or a VLDW. The base-definition of Big-Data means different things to different people. Someone sitting at 500MB might thing "big" is 2TB. Someone at 2TB might think "big" is 8TB or 10TB, and so on. Instead of trying to define big data, I'll discuss the different levels of changes that happen within terabyte sized data sets.

Ranges:
500MB - 2TB
2TB - 5TB
5TB - 10TB
10TB - 50TB
50TB - 100TB
100TB - 200TB
200TB - 1PB+

The ranges are defined as a rough guide. Things change within each range. Data models, disk layouts, CPU to Disk ratio, Speed of networks, sizes of nodes, Large SMP boxes vs small MPP vs Clusters, Queries, Indexing, Constraints and so on. In other words: what works at 2TB doesn't work at 5-6TB. What works at 6TB won't work at 20TB, and so on. Of course there are some hardware vendors out there who provide so much horsepower that these ranges don't apply, and in fact as they progress and "data warehousing appliances" become more common place, they will handle most of these issues for us under the covers. But for now, assuming we are on existing systems, this is something to think about.

What are some of the problems with VLDB/VLDW?
When the systems reach "live" data usage of 20TB to 100+TB, they experience everything from physical performance breakdowns, to servers crashing. The problems that we have with 500GB of data seem small and are easy to overcome, but all minute problems become very large problems when the data set grows above about 15 TB.

List of potential problems: (assuming large SMP boxes)
* Data modeling breakdown, queries across joined tables no longer work at all, no matter how much RDBMS parallelism you throw at it
* Indexing breakdown, no matter what you ask of the optimizer, there's just no performance improvement to query times - even with partitioning of the data set below.
* Backup and Restore no longer work within the time frames desired, and in some cases are near impossible to backup and/or restore entire data sets. Disaster recovery is HIGH RISK!
* Traditional data over disk layouts STOP working completely, and in fact become a negative performance attributor.
* Replication systems choke over bottlenecked I/O and networks
* Maintaining distributed data centers becomes a 6 month project just to architect how it's going to work, then there's negotiations for 24x7x365 bandwidth to keep the data flowing.
* At about 50TB, cost of maintenance, machines, cooling systems, power grids, and IT resources begins to increase by a multiplier of 5x.
* After about 50TB, there are no "canned standards" that companies can follow for a successful VLDB/VLDW.

As far as mitigation strategies, relying on experts or those that have built and architected systems for these sizes is paramount. Architecture is everything in these systems, without long-term architecture and forward thinking the systems experience growing pains at around 20TB to 48TB, and then the company must put an all-engines-stop out and re-build from the ground up (very costly), or migrate to a new platform (also can be very costly).

Denormalization is one mitigation strategy that will help, but only in certain cases. Remember that denormalization of data sets will instantly double or triple the storage requirements. Here's a fallacy for you: Storage is cheap. NOT SO at big data levels. If you buy cheap storage, you get "poor performance" or lack of parallelism. Furthermore, the more "performance" you want to drive out of a VLDB/VLDW, the more storage you may actually need.

So what about the data sets? Why can't we/shouldn't we reduce them?
I agree with the experts when they say: too much unused data is a bad thing. I disagree with experts when they say: too much bad or poor quality data is a bad thing.

There are two basic types of information in VLDB/VLDB:
1. Good data (aggregated, cleansed, merged)
2. Transactional Data (Auditable/complaint/traceable)

The business users are divided into multiple user groups: 80%-90% of those that use the good data, or moderately good data (good data is open to the end-users interpretation), and 10%-20% of those that require transactional details.

In the Good data set, there's no reason to keep around "old" or unwanted/unused data sets. They should be removed, or placed on a rolling usage cycle. However in the transactional data set (transactional with history), it's at the lowest possible grain. The more data the better! Especially if the business is mining the data set, and/or has audit requirements or federal/international mandates that state it must be kept on line.

Data mining loves big data, the more data it can mine, the better it's predictions and confidence ratings. The less granular detail it can mine, the worse it's predictions are - you might as well go back to aggregates and standard statistics. In this case, the credit-card processing company also has SLA's with it's vendors, along with the need to detect fraudulent activity - they MUST (and do) use a data mining tool on the transactional historical data.

With all these headaches why build a VLDW? Why not just go back to the old-style analytics backed with aggregations, averages, and statistics? Won't that save cost?
Yes, it will save on cost - but here's what you can gain if you build one. This company has 6 months of transactional history on-line, dynamically accessible by end-users. This equates to about 120TB. They are seeing at least a 5x multiplier on ROI (compared with the costs for maintaining and supporting it). They mentioned that if they could keep 12 months of data on line, they would do it in a heart-beat, and their multiplier would go up to 15x or higher.

The reason? They are missing enough data to significantly impact their decision making capabilities, especially with the data mining engine. In this game, the business must spend a little to gain a lot - especially if they know what questions to ask and have a firm grasp on how the answers will make them more effective and more competitive.

There's more, a lot more - I discuss the details in my class, along with mitigation strategies - I'd love to meet with you at TDWI in DC (may 19th 2005) should you wish to drop by. See you next time.


Posted May 5, 2005 4:34 AM
Permalink | No Comments |

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›