I was just scanning Philip Russom's October 2007 monograph on "Unifying the Practices of Data Profiling, Integration, and Quality," and noticed this:
"The quality of data degrades as application update, add data to, or delete data from a database. Most estimates say that 10-12% of the data in an active database becomes dirty, nonstandard, or redundant each month. Hence, if you cleanse a database 100% today, it will only be 88-90% clean 30 days from now."
I love finding sentences like this that provide some (hopefully objective) third-party providing a hard statistic that can be used as ammunition for supporting a data quality initiative. On the other hand, I do get concerned when it an unnamed source (the "most estimates" part, I mean) is used to provide the statistic.
Actually what would happen if the data is never cleansed? How about for six months? Does the 10-12% degradation apply to all the records or only to currently clean ones? OK, simple arithmetic - if each month 10% of the clean records become dirty, nonstandard, or redundant each month, then after 6 months (.9*.9*.9*.9*.9*.9) * 100%, or 53.1% of the records are unsullied.
I am not sure that I believe this to be the case. Perhaps the rate at which data becomes dirty slows each month, since those records with a higher propensity to become flawed (for what ever reason - multiple touchpoints, commonly-used records, tc.) will have already been subjected to an error, so they would not be counted the second month?
If anyone has any references to this 10-12% number, post a comment with a link!
Posted July 15, 2009 1:06 PM
Permalink | 1 Comment |



