Data Quality and Data Profiling Objective versus Subjective Data Quality

by David Loshin

Originally published December 11, 2008

The conventional wisdom suggests that “quality data” is defined in terms of “fitness for use.” This somewhat odd phrase is heavy with context and suggests a significant amount of work to be done before one can pragmatically stamp the data with a quality seal of approval. To provide an assessment of the quality of a data set, one must understand who the consumers of that data are, how the data set is used, what it is used for, as well as the data consumers’ levels of expectation. This suggests a pattern for a top-down analysis phase for establishing criteria for data quality management:

  1. Identify the business users

  2. Determine what data sets are being used

  3. Document what business processes are supported by the data sets

  4. Evaluate the data elements that are most critical to satisfying the need of the business processes

  5. Work with the business data consumers and other subject-matter experts to understand when data quality interferes with the successful operation of their business processes

  6. Qualify the expectations in terms of a means for measurement

  7. Establish processes for measurement and scoring

  8. Establish acceptability levels

This approach should provide a way of defining data quality criteria that are clearly specified in terms of the downstream business data consumers. By understanding the “use,” the data analyst has a framework for defining “fitness” in relation to the way the data is used and what makes the data suitable for those purposes (as characterized by the levels of acceptability). You might call this “subjective data quality” since the definition of quality is within the context of the business process and data values that are good for one purpose might be less than acceptable for other purposes.

On the other hand, one can consider that for certain kinds of data, the values are either of high quality or they are not. For example, a deliverable mail address either has the correct ZIP code or it doesn’t. We could state that a high quality address must have a correct city, state and ZIP code, no matter its use, so if the record does have the correct ZIP code for the named city and state, that record is of high quality. Actually, almost any definition and use of what is called reference data could be considered in terms of “objective data quality” – the data values could be considered to be of high quality outside of any specific business context. Lookup tables, codes and descriptions all fall into this category, especially when they are used across many different applications in the organization. Corroboration of data values in relation to reference data is particularly interesting since it assumes the objective accuracy of the data in the lookup tables.

Here are some tricky questions: At what point does one care more about subjective data quality than objective data quality? When can we use technology to empirically assess the levels of data quality? Let’s put the first question aside and come back to it in a bit. Of course, data profiling combines a set of techniques for analyzing data out of any context; so for assessing the objective quality of the data, profiling is a good, albeit blunt instrument. This is due to the fact that most profilers, as bottom-up tools, essentially provide column data value statistics and frequency analysis, which helps in highlighting obvious potential issues such as duplication, unexpected value frequencies, distributions (or cardinality), outlier values (highest or lowest), missing values and the like.

But just sitting down in front of a data profiler is essentially going on a fishing expedition, with no guarantee that you will catch anything tasty. And the profiling process can lead to meandering through reports and statistics without really understanding what tolerance levels are for values and when they have been missed. So this brings us back to the first question: one begins to care about subjective data quality when the objective assessment loses focus.

For example, in one environment, the data analysts evaluated the nullness percentages of the columns. Certain columns rang up alarmingly null; but when this was brought to the attention of the customer, it was explained that the data model being used had been purchased from a third party and that column was one that was not necessary for their operations. The attribute was not present in the legacy system and the new application never used it, so its presence remained under the radar with no one really caring about the effort to actually modify the data model. The upshot was, from an objective standpoint, the absence of data in the column was bad, but – from the subjective standpoint – irrelevant.

That being said, the effort to profile, review, raise an issue about and then ignore the null attribute was not necessary. In fact, that effort could have been avoided altogether had the analysts reviewed the model with the customer beforehand and singled out those data elements that were critical to the business process, and then focused their attention on profiling specific characteristics about those critical data elements. In other words, combining the bottom-up aspects of data profiling with the top-down process described at the beginning of this discussion provides a greater focus for the assessment and optimizes the effort for determining the best opportunities for data quality improvement.

  • David LoshinDavid Loshin
    David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of Master Data Management, Enterprise Knowledge ManagementThe Data Quality Approach and Business IntelligenceThe Savvy Manager's Guide and is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

    Editor's note: More David Loshin articles, resources, news and events are available in the David Loshin Expert Channel on the BeyeNETWORK. Be sure to visit today!

Recent articles by David Loshin


Related Stories


 

Rate This Article

Want to rate this article? Login or become a member today!

 
 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!