Some Perspectives on Quality

Originally published July 23, 2009

When it gets right down to it, there are a lot of perspectives on quality of data in a data warehouse. There is one perspective that says that data should be corrected whenever it can be corrected before it is ever entered into a data warehouse. When a value is found to be incorrect, the data should be amended if at all possible before it is placed in a data warehouse. We probably have data quality pioneer Larry English to thank for this perspective. But there are other schools of thought. There is another school of thought that holds that data should not be corrected before entry into a data warehouse, even when it is known that the data is incorrect. We probably have Geoff Holloway to thank for this perspective.

Now these are two very different approaches to the same problem. So who is right and who is wrong? It turns out that they are both right. You just have to supply the proper context to tell the difference between the two different perspectives.

In general, if data is known to be incorrect and if there is a known way to correct – to “fix” – the data, then it makes sense to correct the data if you can. Larry English is correct in this general case hypothesis. It simply does not make sense to place incorrect data into a data warehouse if it is possible not to.

But is it true that data should always be corrected before entry into the data warehouse? It turns out that there are quite a few circumstances when data should not or cannot be corrected prior to entry into the data warehouse. For example, suppose that upon finding some incorrect data, there is no known way to correct the data. The best you can do is to put a default value into the field that is known to be incorrect. But that really isn’t correcting the data. Or suppose that you want to provide a clerk with the best information possible in order to figure out how to actually make a correction. Geoff Holloway makes the point that keeping incorrect data even when it is known to be incorrect often provides valuable clues as to how data needs to be corrected. When the incorrect data is thrown away, no such clues exist. Or, there is the case where there is a legal reason to keep incorrect data. Take loan applications. Suppose that a home loan application is made for a billion dollars. Even though no bank or lending institution makes loans of this amount for a home and even though everyone – including the applicant – knows that this value is a mistake, there is a legal obligation to leave the loan application untouched even when it is known that the application contains incorrect or faulty data. Only the originator of the loan application can correct the data, and even then the correction is made by filing a new application, not by reentering the old application.

There are then very legitimate circumstances where incorrect data is best left in the database or data warehouse. Stated differently, there is no circumstance where correcting data or not correcting data is the right thing to do. In order to determine which approach is proper, the context of the corrections has to be known. Only then can it be determined whether correcting errors is the proper thing to do.

But there are other perspectives.

Another perspective is questioning whether or not exactness and complete accuracy matter in a data warehouse? Is it possible to have some degree of inaccuracy in a data warehouse? The answer is that while accuracy is certainly desirable and is certainly important, complete accuracy should not be needed in a data warehouse. Consider the following context – a manager wishes to find out how much monthly revenue was across the United States for a product. The answer comes back that there was total monthly sales of $12,076,917. Now it turns out that there was a sale in Oklahoma that was for $102 that was incorrectly recorded as a sale of $10.20. On the one hand, it is true that the information in the data warehouse is not perfectly accurate. On the other hand does it matter? If there was an error of $90 out of a total of $12,000,000, is the error going to influence any decision negatively? The answer is absolutely not.

But if it turns out that all sales in New Mexico, Arizona, Colorado and Utah have been recorded improperly, and if it turns out that there was a differential of $500,000 out of $12,000,000, then there is every likelihood that this kind of error will in fact make a difference to any decision made from the calculation. Accuracy then is a relative issue in the face of the kinds of decisions being made with data from a data warehouse.

Because data in the data warehouse is used in summary, there is some tolerance for error. Of course, the tolerance for error may be small. But there is nevertheless a tolerance for error.

It is only at the individual record or account level that there is no tolerance for error. It is when a system keeps track of individual records – checks, deposits, ATM activities and the like that there is no tolerance for error.

So the notion of data quality takes some surprising twists when viewed in the context of data warehousing and decision support processing.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon

 

Comments

Want to post a comment? Login or become a member today!

Posted July 24, 2009 by Phil Bailey phil.bailey@btconnect.com

I agree with Karien, and also Bill, as one the first benefits to be achieved from a data mart or warehouse is VISIBILITY of the data from your business apps... and if IT think they can or should be accountable for 'fixing' it then I would argue that they should think again. IT are merely custodians of data, not the owners. Garbage in, garbage out - IT do not have a magic wand, and should not try to fix things they are not asked to fix by the business.

So George, surely your problem lies with your IT systems, although I appreciate that in complex, large scale environments, fixing small things like code value anomolies is difficult to get 'fixed' back in the source systems and you have to 'live with it' in the DW domain.

In my designs I try to provide a view of the RAW source data within a persistent version of the staging layer, (building up history that may no longer be stored in the sources over time due to archiving etc) and then providing the conformed, standardised and 'fixed' view within data models - but making sure that it is possible to reconcile back to the raw layer to allow users to trust that the data is correct.

If you were to 'fix' a null value in a particular column, how do you then reconcile this back to the source? Your reported figures are instantly different from the same count from the source system; creating one of the gravest problems in MI - inconsistnet figures - instantly your DW has lost it's credibility!

Is this comment inappropriate? Click here to flag this comment.

Posted July 24, 2009 by Karien Verhagen

A very important issue about data quality is the awareness of the business of the importance of the registration of correct data. This awareness is not served by hiding all kinds of correcting and cleaning algorithms in the ETL. Given the choice it is far better to report incorrect values and improbabilities to the source data owners and let them correct if appropriate. Source owners will develop awareness of their links with other lines of business and develop a corporate awareness that is very valuable and a lasting contribution to data quality in the corporate data warehouse.

Is this comment inappropriate? Click here to flag this comment.

Posted July 23, 2009 by george.allen@va.gov

I definitely fall into the Larry English camp.  Data quality problems speak to poor programming practices at the front end application, allowing for incorrect data to be entered beyond expected parameters or not allowing for verification at the point of entry.  Correcting these problems at the point of moving the data to the warehouse is a exacting process but also allows for more flexibility and accuracy of the data within.

Of course, if one is only expectign aggregation to come out of the warehouse, then small errors are of no consequence, but if the small error prevents a whole set of data from being included in the aggregrated result, say an improperly coded department number, then the errors become much more problematic.

I work in the healthcare arena, and have many data quality issues to deal with caused by too many data entry loopholes in the applications recording the information.  Though my warehouse is mainly used for aggregating care performance, we are now branching out to more detailed analysis of protocols and direct patient care costs, so we cannot allow for as many errors as we might with more aggregated reporting.

This is my personal opinion and not necessarily the opinion of the Department of Veterans Affairs.

Is this comment inappropriate? Click here to flag this comment.