Blog: Claudia Imhoff Subscribe to this blog's RSS feed!

Claudia Imhoff

Welcome to my blog.

This is another means for me to communicate, educate and participate within the Business Intelligence industry. It is a perfect forum for airing opinions, thoughts, vendor and client updates, problems and questions. To maximize the blog's value, it must be a participative venue. This means I will look forward to hearing from you often, since your input is vital to the blog's success. All I ask is that you treat me, the blog, and everyone who uses it with respect.

So...check it out every week to see what is new and exciting in our ever changing BI world.

About the author >

A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines.

She is also the Founder of the Boulder BI Brain Trust, a consortium of independent analysts and consultants (www.BBBT.us). You can follow them on Twitter at #BBBT

Editor's Note:
More articles and resources are available in Claudia's BeyeNETWORK Expert Channel. Be sure to visit today!

I just finished reading an interesting article in the latest TDWI newsletter (dated today)titled "Data Cleansing versus Auditability: Achieving Competitive Advantage in a Cautious Era" It was written by Michael Mansur, Simon Terr, and David Stanvick from HP Services' Information Management Practice. It got me to thinking about the notion that there may be times when we destroy the usability of data by fixing it up too much...

If you are interested in reading their article, click here and sign up for the newsletter. It is worth reading.

I have always heard and I am sure you have too the data warehousing mantra that we must clean up the data flowing into the warehouse as much as possible before turning the users loose on it. Most of the time this is true because it inspires confidence in the business community by correcting many of the errors created by the operational systems that could impact analytics.

However, the point made in the article is that data cleansing also pretty much destroys the audit trail for a piece of data. So in this era of regulations, compliance issues, security and privacy breaches, losing the audit trail of a piece of data can be a serious problem. There should be a way to balance the need to cleanse the data to achieve the goals of the users while preserving the original values in case of an audit.

Let me give you an interesting example of when quality processes destroy the value of the data. Several years ago, I worked with a credit card company on their data warehouse. They stored all the transactions from credit card users for quite a long period of time, making them available to analysts to slice and dice, do fraud models, perform regression tests, and perform other analytics.

All was going well until I ran into a lady who was responsible for determining when the credit card readers were beginning to pot out. I guess the card swipes don't just suddenly die -- they slowly lose their ability to "read" the magnetic strip on your card. What happens is that they begin sending in some of the data but not all. And the missing bits change with every swipe.

The lady told me that she did her own extractions against the operational database that collected all these transactions rather than use the data warehouse data. I was puzzled why she would go to all that trouble to redundantly do what others were already doing. I decided to pursue it more.

Turns out that the implementation team was really efficient at creating integration and quality processes -- so good that, by the time the transactions hit the data warehouse, all the missing or partial data had been overwritten with the correct fields. The data that would indicate a reader going bad was completely gone... This was a case of too much or inappropriately placed data quality processes.

The article in TDWI's newsletter suggests four options to overcome this problem:

1. Flat-file archive - which they say is the simplest technique. You capture all the operational data extracted into a flat file and archive it somewhere. Downside is of course that this is not a real DBMS so analytics and comparisons get to be difficult.

2. Minimal duplicates - you retain both the original data and its fixed up counterpart through the ETL layer to the presentation layer. In other words the original data is stored in its own redundant column. Not as elegant and certainly not as simple as the first solution but it does have some advantages since the data is actually in the warehouse database and readily available. However, redundancy comes with its own set of problems in a growing warehouse.

3. ETL Express - the authors say this is a variation of the flat file idea. You use the same flat file technique with the addition of a "straight through" or ETL express process that can be run on the source data when needed. The express jobs populate special audit tables, each having duplicated primary keys tied directly to the cleaned up warehouse data. Redundant data is brought into the warehouse only when it's needed for audit purposes. Again somewhat limiting and complex but better than options 1 or 2 in most cases.

4. ABC Tables - The last option is an implicit approach which involves creating audit, balance and control (ABC) tables. These tables are used for more than just auditing since they support the balancing and control functions as well. Rather than explicitly storing the original and corrected data values and the history of changes, this option "makes business rules available so source and intermediate values of data can be constructed from the final target values". Obviously this is a complex process and may not suit your audit purposes exactly.

My advice is to do a thorough study of the audit needs in your enterprise and then craft the option (or combination of options) that supports your requirements AND is feasible in your technological environment. It may not be simple but, in the long run, you will create an environment that provides everyone with a satisfactory BI environment -- even the auditors!

Yours in BI success,

Claudia


Posted June 28, 2007 10:57 AM
Permalink | 2 Comments |

2 Comments

I think a lot of DW projects make the mistake of thinking some data quality cleansing on loads will give the DW data a gleam that it doesn't have in dirty source systems. This is only half the work. The metrics on what was bad and reports back to the source system are also needed.

The tricky part is getting project sponsors to pay for the extra development costs of the audit approaches you listed when the priority task is to just get the data into the database.

Hmmm, this is why companies need BUSINESS analysts involved in projects. When you focus on a single aspect of the business (cleaning up data), you often overlook the impacts/ramifications on other aspects of the business. A point I've tried to make for years and years.

And I'm not talking about the quasi-system engineering type of business analysis, with data dictionaries and process flows and all that. I'm talking about someone (or someones) with the responsibility of thinking through and examining business flows to evaluate and identify issues just like the one described above. Call them business quality circles, centers of excellence, whatever you want - but when a company has someone tasked with always looking at the big picture, things like this are less likely to happen.

Unfortunately, since the value of this kind of analysis lies in avoiding or preventing negative impacts, the value case is often frustratingly difficult, especially in public corporations.

Leave a comment

    
   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›