|
« IBM - DB2 UDB 9.x - hot new technology |
Main
| ETL, ELT - Challenges and Metadata »
The question has been argued over the past two decades, is more data better? Do I really need more data? Where on earth is all this data coming from? How do I manage the ever-growing data sets? Does more data mean better business decisions? How can I reconcile these monstrous data sets? and so on... You've heard by now (I'm sure) many different folks in the industry offer their valued opinions. We can stand up on our feet and say: I'm on the fence - because half the time I hear it's the quality of the data that matters, the other half the time I have to defend the auditability and traceability of the data set in my warehouse. We can also stand on the fence because we can now "mine" for bad data patterns (only if we are collecting them), and learn where our mistakes are.
This is the never ending story of a data set... (sorry, I'm punchy this morning). Really, it's a fantasy land... Ok - time to get real.
Where is all this data coming from?
It's coming from: Unstructured, Semi-Structured, and more granular data feeds. It's coming in on web-services, off the web-scraping of alternate sites, it's coming in from providers, suppliers, customers, consumers, builders... You get the picture.
Why do I need all this darned data?
How about:
* Compliance, accountability
* Metrics
* Business Discovery of "what's going wrong and when"
What about data quality? Doesn't it reduce the data set and improve my decision making abilities?
I've said it before, and I'll say it again..... Data quality or better put: Information Quality definitely reduces the data set, and absolutely allows a smoother, better flow toward more accurate business decisions. But it does something else: it HIDES the broken business processes, it HIDES the problem areas in the source collection mechanisms which are COSTING YOU MONEY and time.
Well if I let all this "bad" data into my warehouse, won't I overwhelm the users? Won't I paralyze my abilities to make good decisions?
No. Absolutely not. Take a look ad DW2.0 and the stack. Make your Data Warehouse (data integration store) a system of record, make it a granular accountable data store integrated by horizontal business keys. Then produce MARTS, error marts, and data marts - two classifications. Separate the BAD data from the GOOD data at the mart loading level. There are only a few power users who can take the bad data and find out WHY it's bad, this is the discovery process that keeps the business from Hemorrhaging money at the source system or data provider level. The rest of the user base should access the GOOD data in the data mart, that's where the cleansed data should be put.
Use the data warehouse and the traceability of the data in the warehouse to "mine" the bad data for patterns - then use business discovery to find out why it's bad, and how much its' costing the enterprise. I think the dollar figures you can save may astound you.
Keep in mind I am NOT advocating release of this "bad" data to the general end-user base. Rather, that it's a different kind of BI - one that is used to watch the metrics of business activity management, and business process improvement. As the data improves, one can actually (quantitatively) see the impact of business changes in the source data providers. The dollar cost can be measured, thus you've reached Level 4 of CMMI principles. From there, you can OPTIMIZE your business processes, and again, quantitatively measure the results as the bad data "subsides" from being loaded into your data warehouse.
Take control of your business, stop spending ruthlessly, understand the critical path of business processing - a path to true enlightment... (not really, I just threw that in there for fun).
Is more data really better?
Yes - but it depends on what you do with it, and how you separate it into two major classes: "Today’s GOOD data / i.e.: today’s version of the truth", and "Today’s BAD data / that data that doesn't FIT in today’s version of the truth." Bad Data patterns can expose really broken business processes, even from a historical stand-point.
Do I really need more data? Where on earth is all this data coming from?
No, we never need MORE data (although for data mining, the more granular the data set, and the more of it, the better the mining algorithms can predict things you're looking for). On the other hand, we're being forced to use more data: Compliance, Unstructured, Semi-Structured, Web, and Web Services are all contributing data to this integration vision.
* Note: Master Data As A Service is something that may help "consolidate" across organizations. For a prime example of customer MDaaS, look at the company Acxiom.
How do I manage the ever growing data sets?
Good question - ARCHITECTURE, ARCHITECTURE, ARCHITECTURE - it's all in the models we use. If the models aren't scalable, flexible, repeatable and consistent in their design, they will fail - but then again, it's impossible to reach the holy grail of the perfect architecture all the time. Flexibility and standardization are key to making this work. The data model architecture is the crux of success for this kind of effort. You can read more about this on my web site (The Data Vault Data Modeling Architecture) at: http://www.DanLinstedt.com
This also requires proper hardware sizing, performance and tuning of applications to make it work.
Does more data mean better business decisions?
Maybe not external facing business decisions, but internally facing business decisions about optimizing business processes - absolutely.
How do I reconcile these monstrous data sets?
It takes a couple of months, and some dedicated power users on the business side. The business drivers must be there to cut-costs, cut-delivery time, improve product/service quality, or drive to CMMI level 5 across the business - or there's no method to pay for the time it takes to reconcile the data sets and find out why broken data is coming in from source systems in the first place.
I specialize in big-data systems, big-data problems, and business intelligence. Feel free to comment, or contact me directly.
Thank-you,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com
|