I just blogged on the need for allowing dirty data to flow through to an auditable reporting area. There are a lot of questions about WHERE the dirty data should reside, and where the dirty data should be cleaned up. In this blog entry we'll dive in to take a look at that in a short consolidated view. If you are fighting with compliance and auditability at a systems or data level, hopefully this will be helpful.
In Nirvana land, there would never be dirty data. Our source systems would be perfect, and would capture all clean data. And if the data wasn't clean it would go to great lengths to get it that way before releasing it to the rest of the corporation. Of course in Nirvana land we would have a SINGLE STACK (one operational system, one data store that had both - current / transactional and historical / warehouse like) information system.
In Nirvana land the single system would integrate with web-services external to our company. In Nirvana land, this "single system" would really be the be-all-end-all, dare I say it: Appliance / Platform / pre-configured with all that you need, everything from DSS to source system OLTP and edit checks to information quality, and reporting and dashboards. In Nirvana land, it would all be... well... nirvana.
But alas, this is not the case. In the real world, for years - the data warehousing industry has taught or preached that data quality should be between the source system and whatever the next "step" of integration is: which might be an ODS, an EDW, an ADW, a Kimball Bus Staging area, and from a strategic point of view - this may work nicely. As long as whatever tools are being used to change and alter the data produce the correct audit trail.
However from a tactical or operational point of view, or from a point of view where we are trying our best to "catch" broken source systems, it doesn't work so well. Cleansing the data or using data quality to produce a "single version of the truth" can have a negative impact on our ability to be accountable (as business users) and actually get the source systems fixed. That said: where oh-where do we put the IQ systems?
Between Sources and ODS?
Between Sources and Data Warehouses?
Between ODS and Data Warehouses?
Between Data Warehouses and Data Marts?
And where oh-where do we put the "bad" records? Do we kick them back to the source system? Do we report on them separately for end-users to clean up? Do we leave them in a daily-snapshot set of error tables? Do we let them through our ODS, and Data warehouse environments to end up in separate data mart areas?
Remember that TRUTH is subjective, and no one can tell "who's truth it is - or when". And also, that auditors have a very different definition of what truth is, they try to take an objective view of truth - one that is traceable and auditable.
Putting DQ or IQ between sources and the ODS allows Master Data to be built within the ODS - the caveats are that the bad data, pre-altered data trails, and any audit trails produced by the integration or "data alteration" stream MUST be warehoused somewhere over-time. Simply "improving the quality of the data" without showing the errors to the business causes all kinds of heartburn downstream. Kicking bad records back to the source system is fine (for a while), until more code must be hand-written on the source system to take it as a new feed, this drives maintenance costs up through the roof. Having quality / cleansed data in the ODS allows great interaction from a strategic standpoint, and possibly allows transactional feeds back to the source systems. Ultimately it greatly increases the complexity of the solution that is handling "BAD DATA" up-stream.
Putting DQ or IQ between the Sources and the Warehouse can (and often does) cause auditability problems. Not many of these ETL or ELT tool sets have the capacity to create data "as of before it was changed/aggregated/filtered", that must be architected and increases the difficulty of the load routines - not to mention slows down the overall load cycles. Furthermore, Bad data must be captured in "error tables" which are usually in the staging area, sometimes these are only daily refresh - no analytics can be run on these data sets, they disappear to fast. As we move towards real-time or near-real-time processing, there is NO TIME to process, integrate, aggregate the data (either between sources and ODS or between Sources and your DW). Real-time dictates that when the data is available, good bad or indifferently, that it be loaded to your warehouse AS IS - no delay's, otherwise the data decays and becomes useless, or the "attempt to detect fraud" is cleansed away, cleaned up and never detected.
Putting DQ or IQ between the Warehouse and the Data Marts is probably (in my opinion) the best way to handle this. The Data Warehouse then becomes a statement of integrated facts (integrated by business key), and the marts then become specific versions of the truth from which management can begin to dissect where they are losing money.
This all applies when loading "todayâ€™s" data, and in fact, cleansing with audit trails before reaching either the ODS or the DW applies when loading "todayâ€™s" data as well. What doesn't fly is the data model, once the data model has been constructed for "cleansed" data, it very rarely allows broken data (or error data) in, in other words, loading historical data wreaks havoc on the data model, or dirty historical data is cleansed - and through that process the "broken business process / broken business rules" are cleansed and washed away as well.
Now, is there value to cleaning up the source systems? You bet, this is the ultimate goal. Nirvana. Is there value to having cleansed data in your ODS? Absolutely! You get the idea, there's definite value in having cleansed and quality checked data, but there's also value in running analytics on the dirty data, when did it break the business rule (back in history), why? What was the pattern? Then the real work begins: what caused the pattern? Is the pattern still happening today? Is the business losing money or opportunities with customers because the source system is still broken? And so on.
So next time you think about where to PUT your DQ/IQ solution, consider what the impact is of NOT looking at the dirty data, ask the question: instead of kicking the data back to the source system, is there somewhere we can warehouse this information to examine it's patterns? I think you'll be surprised at all the business accountability you'll uncover.
Love to hear your thoughts on this.
Posted August 29, 2006 6:37 AM
Permalink | 2 Comments |