I just finished reading an interesting article in the latest TDWI newsletter (dated today)titled "Data Cleansing versus Auditability: Achieving Competitive Advantage in a Cautious Era" It was written by Michael Mansur, Simon Terr, and David Stanvick from HP Services' Information Management Practice. It got me to thinking about the notion that there may be times when we destroy the usability of data by fixing it up too much...
If you are interested in reading their article, click here and sign up for the newsletter. It is worth reading.
I have always heard and I am sure you have too the data warehousing mantra that we must clean up the data flowing into the warehouse as much as possible before turning the users loose on it. Most of the time this is true because it inspires confidence in the business community by correcting many of the errors created by the operational systems that could impact analytics.
However, the point made in the article is that data cleansing also pretty much destroys the audit trail for a piece of data. So in this era of regulations, compliance issues, security and privacy breaches, losing the audit trail of a piece of data can be a serious problem. There should be a way to balance the need to cleanse the data to achieve the goals of the users while preserving the original values in case of an audit.
Let me give you an interesting example of when quality processes destroy the value of the data. Several years ago, I worked with a credit card company on their data warehouse. They stored all the transactions from credit card users for quite a long period of time, making them available to analysts to slice and dice, do fraud models, perform regression tests, and perform other analytics.
All was going well until I ran into a lady who was responsible for determining when the credit card readers were beginning to pot out. I guess the card swipes don't just suddenly die -- they slowly lose their ability to "read" the magnetic strip on your card. What happens is that they begin sending in some of the data but not all. And the missing bits change with every swipe.
The lady told me that she did her own extractions against the operational database that collected all these transactions rather than use the data warehouse data. I was puzzled why she would go to all that trouble to redundantly do what others were already doing. I decided to pursue it more.
Turns out that the implementation team was really efficient at creating integration and quality processes -- so good that, by the time the transactions hit the data warehouse, all the missing or partial data had been overwritten with the correct fields. The data that would indicate a reader going bad was completely gone... This was a case of too much or inappropriately placed data quality processes.
The article in TDWI's newsletter suggests four options to overcome this problem:
1. Flat-file archive - which they say is the simplest technique. You capture all the operational data extracted into a flat file and archive it somewhere. Downside is of course that this is not a real DBMS so analytics and comparisons get to be difficult.
2. Minimal duplicates - you retain both the original data and its fixed up counterpart through the ETL layer to the presentation layer. In other words the original data is stored in its own redundant column. Not as elegant and certainly not as simple as the first solution but it does have some advantages since the data is actually in the warehouse database and readily available. However, redundancy comes with its own set of problems in a growing warehouse.
3. ETL Express - the authors say this is a variation of the flat file idea. You use the same flat file technique with the addition of a "straight through" or ETL express process that can be run on the source data when needed. The express jobs populate special audit tables, each having duplicated primary keys tied directly to the cleaned up warehouse data. Redundant data is brought into the warehouse only when it's needed for audit purposes. Again somewhat limiting and complex but better than options 1 or 2 in most cases.
4. ABC Tables - The last option is an implicit approach which involves creating audit, balance and control (ABC) tables. These tables are used for more than just auditing since they support the balancing and control functions as well. Rather than explicitly storing the original and corrected data values and the history of changes, this option "makes business rules available so source and intermediate values of data can be constructed from the final target values". Obviously this is a complex process and may not suit your audit purposes exactly.
My advice is to do a thorough study of the audit needs in your enterprise and then craft the option (or combination of options) that supports your requirements AND is feasible in your technological environment. It may not be simple but, in the long run, you will create an environment that provides everyone with a satisfactory BI environment -- even the auditors!
Yours in BI success,
Claudia
Posted June 28, 2007 10:57 AM
Permalink | 2 Comments |



