Blog: Dan E. Linstedt« Nanomorphing feedback loops, terminator Eyeballs. | Main | Structural Mining, Dynamic Data Warehousing, Neural Nets » Big Data = Big Problems = Huge ROI if done right.There's a lot of talk in the industry today about VLDW/VLDB (very large data sets), and how too much data might not be such a good thing. I take a different opinion on this subject. In this blog I hope to explore the following questions: What is VLDW/VLDB? What are some of the problems with it? What kinds of ROI multipliers might I find in a big-data set? I've recently had discussions with a major credit card processor, and as a result will share with you some of the common issues that they face daily. VLDW/VLDB is defined to be big data, does it mean we have a 1TB, or 10TB or 100TB data store sitting there? No, if the data is sitting there, and is not used for business purposes then by all means - it shouldn't be stored on-line (due to cost), or the business may not be looking at their information hard enough or with the right questions to use all the data. Something to think about: Data Mining has begun to be a viable solution to providing analytics, trend analysis, and forecasting above and beyond traditional statistics. In other words, companies with extreme competitive advantage are using Data Mining to reach and discover things about their business that they didn't previously know, or to predict future outcomes with a confidence rating that enables business decisions that make sense. Having big data and using it are two different things. If you use 80% or better of your big-data sets, then you have a VLDB or a VLDW. The base-definition of Big-Data means different things to different people. Someone sitting at 500MB might thing "big" is 2TB. Someone at 2TB might think "big" is 8TB or 10TB, and so on. Instead of trying to define big data, I'll discuss the different levels of changes that happen within terabyte sized data sets. Ranges: The ranges are defined as a rough guide. Things change within each range. Data models, disk layouts, CPU to Disk ratio, Speed of networks, sizes of nodes, Large SMP boxes vs small MPP vs Clusters, Queries, Indexing, Constraints and so on. In other words: what works at 2TB doesn't work at 5-6TB. What works at 6TB won't work at 20TB, and so on. Of course there are some hardware vendors out there who provide so much horsepower that these ranges don't apply, and in fact as they progress and "data warehousing appliances" become more common place, they will handle most of these issues for us under the covers. But for now, assuming we are on existing systems, this is something to think about. What are some of the problems with VLDB/VLDW? List of potential problems: (assuming large SMP boxes) As far as mitigation strategies, relying on experts or those that have built and architected systems for these sizes is paramount. Architecture is everything in these systems, without long-term architecture and forward thinking the systems experience growing pains at around 20TB to 48TB, and then the company must put an all-engines-stop out and re-build from the ground up (very costly), or migrate to a new platform (also can be very costly). Denormalization is one mitigation strategy that will help, but only in certain cases. Remember that denormalization of data sets will instantly double or triple the storage requirements. Here's a fallacy for you: Storage is cheap. NOT SO at big data levels. If you buy cheap storage, you get "poor performance" or lack of parallelism. Furthermore, the more "performance" you want to drive out of a VLDB/VLDW, the more storage you may actually need. So what about the data sets? Why can't we/shouldn't we reduce them? There are two basic types of information in VLDB/VLDB: The business users are divided into multiple user groups: 80%-90% of those that use the good data, or moderately good data (good data is open to the end-users interpretation), and 10%-20% of those that require transactional details. In the Good data set, there's no reason to keep around "old" or unwanted/unused data sets. They should be removed, or placed on a rolling usage cycle. However in the transactional data set (transactional with history), it's at the lowest possible grain. The more data the better! Especially if the business is mining the data set, and/or has audit requirements or federal/international mandates that state it must be kept on line. Data mining loves big data, the more data it can mine, the better it's predictions and confidence ratings. The less granular detail it can mine, the worse it's predictions are - you might as well go back to aggregates and standard statistics. In this case, the credit-card processing company also has SLA's with it's vendors, along with the need to detect fraudulent activity - they MUST (and do) use a data mining tool on the transactional historical data. With all these headaches why build a VLDW? Why not just go back to the old-style analytics backed with aggregations, averages, and statistics? Won't that save cost? The reason? They are missing enough data to significantly impact their decision making capabilities, especially with the data mining engine. In this game, the business must spend a little to gain a lot - especially if they know what questions to ask and have a firm grasp on how the answers will make them more effective and more competitive. There's more, a lot more - I discuss the details in my class, along with mitigation strategies - I'd love to meet with you at TDWI in DC (may 19th 2005) should you wish to drop by. See you next time. |