Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

I just blogged on the need for allowing dirty data to flow through to an auditable reporting area. There are a lot of questions about WHERE the dirty data should reside, and where the dirty data should be cleaned up. In this blog entry we'll dive in to take a look at that in a short consolidated view. If you are fighting with compliance and auditability at a systems or data level, hopefully this will be helpful.

In Nirvana land, there would never be dirty data. Our source systems would be perfect, and would capture all clean data. And if the data wasn't clean it would go to great lengths to get it that way before releasing it to the rest of the corporation. Of course in Nirvana land we would have a SINGLE STACK (one operational system, one data store that had both - current / transactional and historical / warehouse like) information system.

In Nirvana land the single system would integrate with web-services external to our company. In Nirvana land, this "single system" would really be the be-all-end-all, dare I say it: Appliance / Platform / pre-configured with all that you need, everything from DSS to source system OLTP and edit checks to information quality, and reporting and dashboards. In Nirvana land, it would all be... well... nirvana.

But alas, this is not the case. In the real world, for years - the data warehousing industry has taught or preached that data quality should be between the source system and whatever the next "step" of integration is: which might be an ODS, an EDW, an ADW, a Kimball Bus Staging area, and from a strategic point of view - this may work nicely. As long as whatever tools are being used to change and alter the data produce the correct audit trail.

However from a tactical or operational point of view, or from a point of view where we are trying our best to "catch" broken source systems, it doesn't work so well. Cleansing the data or using data quality to produce a "single version of the truth" can have a negative impact on our ability to be accountable (as business users) and actually get the source systems fixed. That said: where oh-where do we put the IQ systems?

Between Sources and ODS?
Between Sources and Data Warehouses?
Between ODS and Data Warehouses?
Between Data Warehouses and Data Marts?

And where oh-where do we put the "bad" records? Do we kick them back to the source system? Do we report on them separately for end-users to clean up? Do we leave them in a daily-snapshot set of error tables? Do we let them through our ODS, and Data warehouse environments to end up in separate data mart areas?

Remember that TRUTH is subjective, and no one can tell "who's truth it is - or when". And also, that auditors have a very different definition of what truth is, they try to take an objective view of truth - one that is traceable and auditable.

Putting DQ or IQ between sources and the ODS allows Master Data to be built within the ODS - the caveats are that the bad data, pre-altered data trails, and any audit trails produced by the integration or "data alteration" stream MUST be warehoused somewhere over-time. Simply "improving the quality of the data" without showing the errors to the business causes all kinds of heartburn downstream. Kicking bad records back to the source system is fine (for a while), until more code must be hand-written on the source system to take it as a new feed, this drives maintenance costs up through the roof. Having quality / cleansed data in the ODS allows great interaction from a strategic standpoint, and possibly allows transactional feeds back to the source systems. Ultimately it greatly increases the complexity of the solution that is handling "BAD DATA" up-stream.

Putting DQ or IQ between the Sources and the Warehouse can (and often does) cause auditability problems. Not many of these ETL or ELT tool sets have the capacity to create data "as of before it was changed/aggregated/filtered", that must be architected and increases the difficulty of the load routines - not to mention slows down the overall load cycles. Furthermore, Bad data must be captured in "error tables" which are usually in the staging area, sometimes these are only daily refresh - no analytics can be run on these data sets, they disappear to fast. As we move towards real-time or near-real-time processing, there is NO TIME to process, integrate, aggregate the data (either between sources and ODS or between Sources and your DW). Real-time dictates that when the data is available, good bad or indifferently, that it be loaded to your warehouse AS IS - no delay's, otherwise the data decays and becomes useless, or the "attempt to detect fraud" is cleansed away, cleaned up and never detected.

Putting DQ or IQ between the Warehouse and the Data Marts is probably (in my opinion) the best way to handle this. The Data Warehouse then becomes a statement of integrated facts (integrated by business key), and the marts then become specific versions of the truth from which management can begin to dissect where they are losing money.

This all applies when loading "today’s" data, and in fact, cleansing with audit trails before reaching either the ODS or the DW applies when loading "today’s" data as well. What doesn't fly is the data model, once the data model has been constructed for "cleansed" data, it very rarely allows broken data (or error data) in, in other words, loading historical data wreaks havoc on the data model, or dirty historical data is cleansed - and through that process the "broken business process / broken business rules" are cleansed and washed away as well.

Now, is there value to cleaning up the source systems? You bet, this is the ultimate goal. Nirvana. Is there value to having cleansed data in your ODS? Absolutely! You get the idea, there's definite value in having cleansed and quality checked data, but there's also value in running analytics on the dirty data, when did it break the business rule (back in history), why? What was the pattern? Then the real work begins: what caused the pattern? Is the pattern still happening today? Is the business losing money or opportunities with customers because the source system is still broken? And so on.

So next time you think about where to PUT your DQ/IQ solution, consider what the impact is of NOT looking at the dirty data, ask the question: instead of kicking the data back to the source system, is there somewhere we can warehouse this information to examine it's patterns? I think you'll be surprised at all the business accountability you'll uncover.

Love to hear your thoughts on this.

Dan Linstedt
CTO
Myers-Holum, Inc


Posted August 29, 2006 6:37 AM
Permalink | 2 Comments |

2 Comments

I think you need to put your data quality handling in the place where the heaviest transformation and conversions occurs, and this is usually from sources to ODS. This step has already got what I call row leakage, the drop off of rows that could not conform to the job code or the target metadata schema and were unexpectedly dropped out of the job. So you already have data quality exceptions at this layer, why not do the handled data quality exceptions as well?

I tried defining an ETL data quality firewall in this step where you check the quality of a row at the end of transformation and decide whether to keep or drop the row. There are four outcomes from the firewall: unexpected rejection, failed QA and rejected, failed QA and kept, passed QA and kept. All messages from the firewall go into a data quality messaging table on the ODS, these messages can be linked to exception tables or to ODS tables. So you can keep your dirty data and drill down to the data quality messages for that row.

regards
Vincent

Hi Vincent,

Thank-you for your comment. I hold a vastly different view today than I did years ago when EDW meant all strategic data.

Putting DQ in the place where the heaviest transformation and conversion occurs can be a severe bottleneck to reaching a real-time or right-time solution. That is if your right-time delivery is less than a 5 minute window, and the burst rate of arriving transactions exceeds the queue-wait time needed to fully cleanse the data.

ETL is getting squeezed for the same reasons, and ELT is becomming the norm, where "T" is transformation in a lazy mode - a data quality firewall is an interesting idea, but I'm not sure it's one I would subscribe to. I've got several other postings in my blog on the business value of data, in which I mention how much money the business "found" when they began cleaning up their errors.

Bottom line: we want the source systems to have clean data (nirvana) which will never happen.

Interesting thoughts though, thanks for the comments.
Dan L

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›