Originally published July 23, 2009
When it gets right down to it, there are a lot of perspectives on quality of data in a data warehouse. There is one perspective that says that data should be corrected whenever it can be corrected before it is ever entered into a data warehouse. When a value is found to be incorrect, the data should be amended if at all possible before it is placed in a data warehouse. We probably have data quality pioneer Larry English to thank for this perspective. But there are other schools of thought. There is another school of thought that holds that data should not be corrected before entry into a data warehouse, even when it is known that the data is incorrect. We probably have Geoff Holloway to thank for this perspective.
Now these are two very different approaches to the same problem. So who is right and who is wrong? It turns out that they are both right. You just have to supply the proper context to tell the difference between the two different perspectives.
In general, if data is known to be incorrect and if there is a known way to correct – to “fix” – the data, then it makes sense to correct the data if you can. Larry English is correct in this general case hypothesis. It simply does not make sense to place incorrect data into a data warehouse if it is possible not to.
But is it true that data should always be corrected before entry into the data warehouse? It turns out that there are quite a few circumstances when data should not or cannot be corrected prior to entry into the data warehouse. For example, suppose that upon finding some incorrect data, there is no known way to correct the data. The best you can do is to put a default value into the field that is known to be incorrect. But that really isn’t correcting the data. Or suppose that you want to provide a clerk with the best information possible in order to figure out how to actually make a correction. Geoff Holloway makes the point that keeping incorrect data even when it is known to be incorrect often provides valuable clues as to how data needs to be corrected. When the incorrect data is thrown away, no such clues exist. Or, there is the case where there is a legal reason to keep incorrect data. Take loan applications. Suppose that a home loan application is made for a billion dollars. Even though no bank or lending institution makes loans of this amount for a home and even though everyone – including the applicant – knows that this value is a mistake, there is a legal obligation to leave the loan application untouched even when it is known that the application contains incorrect or faulty data. Only the originator of the loan application can correct the data, and even then the correction is made by filing a new application, not by reentering the old application.
There are then very legitimate circumstances where incorrect data is best left in the database or data warehouse. Stated differently, there is no circumstance where correcting data or not correcting data is the right thing to do. In order to determine which approach is proper, the context of the corrections has to be known. Only then can it be determined whether correcting errors is the proper thing to do.
But there are other perspectives.
Another perspective is questioning whether or not exactness and complete accuracy matter in a data warehouse? Is it possible to have some degree of inaccuracy in a data warehouse? The answer is that while accuracy is certainly desirable and is certainly important, complete accuracy should not be needed in a data warehouse. Consider the following context – a manager wishes to find out how much monthly revenue was across the United States for a product. The answer comes back that there was total monthly sales of $12,076,917. Now it turns out that there was a sale in Oklahoma that was for $102 that was incorrectly recorded as a sale of $10.20. On the one hand, it is true that the information in the data warehouse is not perfectly accurate. On the other hand does it matter? If there was an error of $90 out of a total of $12,000,000, is the error going to influence any decision negatively? The answer is absolutely not.
But if it turns out that all sales in New Mexico, Arizona, Colorado and Utah have been recorded improperly, and if it turns out that there was a differential of $500,000 out of $12,000,000, then there is every likelihood that this kind of error will in fact make a difference to any decision made from the calculation. Accuracy then is a relative issue in the face of the kinds of decisions being made with data from a data warehouse.
Because data in the data warehouse is used in summary, there is some tolerance for error. Of course, the tolerance for error may be small. But there is nevertheless a tolerance for error.
It is only at the individual record or account level that there is no tolerance for error. It is when a system keeps track of individual records – checks, deposits, ATM activities and the like that there is no tolerance for error.
So the notion of data quality takes some surprising twists when viewed in the context of data warehousing and decision support processing.
Recent articles by Bill Inmon