Originally published December 11, 2008
The conventional wisdom suggests that “quality data” is defined in terms of “fitness for use.” This somewhat odd phrase is heavy with context and suggests a significant amount of work to be done before one can pragmatically stamp the data with a quality seal of approval. To provide an assessment of the quality of a data set, one must understand who the consumers of that data are, how the data set is used, what it is used for, as well as the data consumers’ levels of expectation. This suggests a pattern for a top-down analysis phase for establishing criteria for data quality management:
This approach should provide a way of defining data quality criteria that are clearly specified in terms of the downstream business data consumers. By understanding the “use,” the data analyst has a framework for defining “fitness” in relation to the way the data is used and what makes the data suitable for those purposes (as characterized by the levels of acceptability). You might call this “subjective data quality” since the definition of quality is within the context of the business process and data values that are good for one purpose might be less than acceptable for other purposes.
On the other hand, one can consider that for certain kinds of data, the values are either of high quality or they are not. For example, a deliverable mail address either has the correct ZIP code or it doesn’t. We could state that a high quality address must have a correct city, state and ZIP code, no matter its use, so if the record does have the correct ZIP code for the named city and state, that record is of high quality. Actually, almost any definition and use of what is called reference data could be considered in terms of “objective data quality” – the data values could be considered to be of high quality outside of any specific business context. Lookup tables, codes and descriptions all fall into this category, especially when they are used across many different applications in the organization. Corroboration of data values in relation to reference data is particularly interesting since it assumes the objective accuracy of the data in the lookup tables.
Here are some tricky questions: At what point does one care more about subjective data quality than objective data quality? When can we use technology to empirically assess the levels of data quality? Let’s put the first question aside and come back to it in a bit. Of course, data profiling combines a set of techniques for analyzing data out of any context; so for assessing the objective quality of the data, profiling is a good, albeit blunt instrument. This is due to the fact that most profilers, as bottom-up tools, essentially provide column data value statistics and frequency analysis, which helps in highlighting obvious potential issues such as duplication, unexpected value frequencies, distributions (or cardinality), outlier values (highest or lowest), missing values and the like.
But just sitting down in front of a data profiler is essentially going on a fishing expedition, with no guarantee that you will catch anything tasty. And the profiling process can lead to meandering through reports and statistics without really understanding what tolerance levels are for values and when they have been missed. So this brings us back to the first question: one begins to care about subjective data quality when the objective assessment loses focus.
For example, in one environment, the data analysts evaluated the nullness percentages of the columns. Certain columns rang up alarmingly null; but when this was brought to the attention of the customer, it was explained that the data model being used had been purchased from a third party and that column was one that was not necessary for their operations. The attribute was not present in the legacy system and the new application never used it, so its presence remained under the radar with no one really caring about the effort to actually modify the data model. The upshot was, from an objective standpoint, the absence of data in the column was bad, but – from the subjective standpoint – irrelevant.
That being said, the effort to profile, review, raise an issue about and then ignore the null attribute was not necessary. In fact, that effort could have been avoided altogether had the analysts reviewed the model with the customer beforehand and singled out those data elements that were critical to the business process, and then focused their attention on profiling specific characteristics about those critical data elements. In other words, combining the bottom-up aspects of data profiling with the top-down process described at the beginning of this discussion provides a greater focus for the assessment and optimizes the effort for determining the best opportunities for data quality improvement.
Recent articles by David Loshin
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!