Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

In my most recent blog entry I discuss temperature related data sets, near the bottom I bring up a lot of questions about large scale data sets and dirty data. Let's pick up where I left off...

One thing is clear, as we march forward, our data sets will only grow, not shrink. Something to take note is: what exactly is "garbage data", what does it mean? Can you identify it and remove it from your systems without impact to audits? If you clear it out, are you removing the possibility of tying together or discovering a meaningful relationship across your business that you didn't have before? If you have garbage data, does it mean your business is hemorrhaging money?

Data storage is growing, our requirements to "keep it all" is growing, not shrinking. But is it justified?
In one word: Compliance and auditability, we can justify keeping nearly ALL data in the enterprise integrated data store (it's no longer just a data warehouse). But is that enough to keep us out of trouble?

Maybe, let's take a look at some of the other things causing our data sets to grow:
* Incorporation of Unstructured data, along with pointers to that unstructured data
* Incorporation of data mining activities into our daily lives (mining is no longer for the passive business looking to discover something about themselves, mining is now a daily or even a near-real time activity to detect and prevent fraud and abuse).
* Consolidation of massive global enterprise data collection efforts - for a variety of reasons, including safety, redundancy, and consistency.
* Re-Introduction of Master Data sets, and Master Data Management (was around 30 years ago, but hardware kept consolidation at arms length).
* Collection and attachment of Master Metadata to tie meaning and definition to all the data and processes within the Integrated Data Store
* The realization that "DIRTY DATA" can tell us as much about how we do business wrong, as "CLEAN DATA" Can tell us how we are doing business right.

That's right folks, dirty data or garbage data doesn't necessarily mean it can be washed away, tossed aside, or removed from the system anymore. A fully integrated view of the enterprise means: I have a view which tells me what I'm doing WRONG as much as it tells me what I'm doing RIGHT. If you ever are to get to a point where you can consider DATA AS AN ASSET or undergo asset evaluation, we must be able to quantify and qualify the following:

1. What is generating dirty data?
2. Where is it coming from?
3. Is it (the data) consistently repeatable? (Patterns are the same)
4. Are there patterns in the dirty data that lead to FRAUD??
5. Does the dirty data lead to BROKEN BUSINESS PROCESSES?

Ok and the KICKER: Why must I (business owner) be accountable?
Well, maybe you want to be accountable, and the business has tied your hands? Maybe you need to be accountable for compliance reasons, maybe you want to meet your metrics in the BAM side of the house, maybe you are looking to improve rate-of-return on business processing, maybe you want to lower overhead costs and deliver higher quality products to the market faster...

The Dirty Data tells the story. Mining Dirty Data and understanding it's existence over time is just as important as "consolidating, collecting, and producing" high quality, cleansed data for the business (this data can be used to run the business in day-to-day operations. Dirty data can point out where the business is hemorrhaging $$ (money), or perhaps is suffering losses due to unseen fraud.

The moral of the story?
Just because it's dirty, doesn't mean it should be thrown out, deleted, removed from the integrated data store. It should be understood, business users should be held accountable for the production of the dirty data in the first place. They own the business, therefore they own the dirty data and the processes that produced it. I estimate that if a business undertook the answers to the questions above in a single project, they could cut operating costs by at least 10% within 3 to 6 months, and that's being conservative.

When dirty data begins to get cleaned up at the source system, at the business processes, and by business users, usually morale improves, excitement improves, new projects are granted, and productivity gains are seen. WHY? Because business users finally feel empowered to take charge and make the system work the way it always should have. Business users see an opportunity to spend less time worrying about getting around dirty data, and more time using "correct" data downstream. They finally feel justified in fixing 25 year old systems. I was there; I was in an environment where this happened. For the first time in 15 years, we saw more change requests to the source system than had ever been issued in the past.

As the data cycled back through the organization and improved in quality, management took notice of the newly "agreeing" numbers coming through the business reports from the enterprise data integration store. They applauded the efforts, and overhauled business processes from the top down. It was refreshing.

You said that Dirty Data causes hemorrhaging of money.. how does this work?
Ok, hidden costs. For example let's take a look at all the work to put humpty-dumpty back together again (assuming he's the dirty data)...
1. Source system captures dirty data, tries to reconcile - makes changes to the data set, (and if it's legacy, these are 15 year old business rules), breaks compliance, and adjusts the keys so that it's force-fed into the system to match some "record" the logic says it matches.
2. It then pushes this modified record and it's assumed parent downstream, possibly to multiple systems - infecting (like a virus if you will) all systems that come in contact with this dirty data.
3. On the business side, the processors run their daily reports (operational reports), pull the dirty data and now must spend time looking up the right information from paper files, yes - paper files, or in some cases of young or small organizations - they scratch their heads and lose the data because they can't do anything with it. Maybe they spend some _time_ trying to correct it manually, everything from making phone calls to sending emails to the client.
4. Finally the business gets a handle on the dirty data, and annotates the corrections. WHERE? In a word doc, or unstructured emails, but NEVER back in the source system where the data originated. Now the data is spread out like a flash-light beam. The further from the source, the dirtier it gets.
5. When someone tries to build the deliverable, or assemble the services that were ordered, they often deliver the wrong thing, don't meet specs, and so on. Causing the customer to come back upset, or come back multiple times to get the error corrected.
6. The dirty data is then pulled into the enterprise warehouse, or enterprise integration store - from here it get's tagged with "todays version of the truth" error processing rules, and often ends up getting kicked out. Now the business users need to spend more time going back over the error reports, and ALL the emails they had sent to correct the problem in the first place, then and only then - if they feel that this is a repeat customer or if the money from that customer is great enough they will correct it in the source system.
7. The rest of the integration stream tries it's best to further consolidate dirty data through artificial intelligence rules, and statistical rules to get it into conformed dimensions and altered facts. By now the beam of light has been so far spread out that it is only a dim light on the wall in a very dark room.

This cycle starts all over again when new dirty data is entered, costing the business more time and more money (on an exponential curve). It get's tougher and tougher to fix as business rules change. IF A BUSINESS WANTS OR NEEDS TO BE NIMBLE, they must fix the SOURCE of these problems, but in order to do that, they must undertake a dirty data expedition (project). My good friend Larry English speaks of these things in is TQM courses.

So, in your organization - where does dirty data get put? Out with the garbage? or is it addressed the way it should be, as an organizational or business process problem?

Stop hemorrhaging money! Learn how to capture the dirty data and get it fixed now, so that over time, there is less and less dirty data. Don't just assume that the haystack is all bad, and that the data is throw-away, this will lead to bad business decisions in the future.

My firm has a set of industry best practices that we are currently employing at large government, financial, and travel agencies where we can show the return on investment for cleaning up dirty data.

Thoughts? Comments?
Dan Linstedt
CTO, Myers-Holum, Inc


Posted August 20, 2006 5:48 AM
Permalink | 1 Comment |

1 Comment

Dan,

I work in a domain where good data is not mandated at the source system (not a technical limitation but a limitation of this hospitality industry). We have millions of dirty records for which we have data quality/data deduplication processes running across the system. I however agree to your point, that the business must take responsibility for the bad data that they get from their warehouse, and target correction of root cause than invest in workarounds and elaborate IT infrastructure to tie the data elements together at a later stage.

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›