Blog: David Loshin« The Meta-Data Professional Organization | Main | Credit Card Fraud » Dirty Data and Embedded KnowledgeIs it better to clean data on intake or after it has been processed? Let's say you have a data entry process in which names and addresses are input into a system. At some point within your processing, that same data (name and address) will be forwarded to an application performing a business process, such as printing a shipping label. However, it is not necessarily guaranteed that the individual whose name and address was input will ever be sent anything. You desire to maintain clean data, and you are now faced with two options: cleanse the data at intake or cleanse it when it is used. There are arguments for doing both of these options... On the one hand, a number of data quality experts advocate ensuring that the data is clean when it enters your system, which would support the decision to cleanse the data at the intake location. On the other hand, since not all names and addresses input to the system are used, cleansing them may turn out to be additional work that was not necessary. Instead, cleansing on use would limit your work to what is needed by the business process. Here is a hybrid idea: cleanse the data to determine its standard form, but don't actually modify the input data. The reason is that a variation in a name or address provides extra knowledge about the individual - perhaps a nickname, or a variation in spelling that may occur in other situations. Reducing each occurrence of a variation into a single form removes knowledge associated with potential aliases, which ultimately reduces your global knowledge of the individual. But if you can determine that the input data is just a variation of one (or more) records that you already know, storing the entered version linked to its cleansed form will provide greater knowledge moving forward. |