We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Data Quality and Data Profiling Objective versus Subjective Data Quality

Originally published December 11, 2008

The conventional wisdom suggests that “quality data” is defined in terms of “fitness for use.” This somewhat odd phrase is heavy with context and suggests a significant amount of work to be done before one can pragmatically stamp the data with a quality seal of approval. To provide an assessment of the quality of a data set, one must understand who the consumers of that data are, how the data set is used, what it is used for, as well as the data consumers’ levels of expectation. This suggests a pattern for a top-down analysis phase for establishing criteria for data quality management:

  1. Identify the business users

  2. Determine what data sets are being used

  3. Document what business processes are supported by the data sets

  4. Evaluate the data elements that are most critical to satisfying the need of the business processes

  5. Work with the business data consumers and other subject-matter experts to understand when data quality interferes with the successful operation of their business processes

  6. Qualify the expectations in terms of a means for measurement

  7. Establish processes for measurement and scoring

  8. Establish acceptability levels

This approach should provide a way of defining data quality criteria that are clearly specified in terms of the downstream business data consumers. By understanding the “use,” the data analyst has a framework for defining “fitness” in relation to the way the data is used and what makes the data suitable for those purposes (as characterized by the levels of acceptability). You might call this “subjective data quality” since the definition of quality is within the context of the business process and data values that are good for one purpose might be less than acceptable for other purposes.

On the other hand, one can consider that for certain kinds of data, the values are either of high quality or they are not. For example, a deliverable mail address either has the correct ZIP code or it doesn’t. We could state that a high quality address must have a correct city, state and ZIP code, no matter its use, so if the record does have the correct ZIP code for the named city and state, that record is of high quality. Actually, almost any definition and use of what is called reference data could be considered in terms of “objective data quality” – the data values could be considered to be of high quality outside of any specific business context. Lookup tables, codes and descriptions all fall into this category, especially when they are used across many different applications in the organization. Corroboration of data values in relation to reference data is particularly interesting since it assumes the objective accuracy of the data in the lookup tables.

Here are some tricky questions: At what point does one care more about subjective data quality than objective data quality? When can we use technology to empirically assess the levels of data quality? Let’s put the first question aside and come back to it in a bit. Of course, data profiling combines a set of techniques for analyzing data out of any context; so for assessing the objective quality of the data, profiling is a good, albeit blunt instrument. This is due to the fact that most profilers, as bottom-up tools, essentially provide column data value statistics and frequency analysis, which helps in highlighting obvious potential issues such as duplication, unexpected value frequencies, distributions (or cardinality), outlier values (highest or lowest), missing values and the like.

But just sitting down in front of a data profiler is essentially going on a fishing expedition, with no guarantee that you will catch anything tasty. And the profiling process can lead to meandering through reports and statistics without really understanding what tolerance levels are for values and when they have been missed. So this brings us back to the first question: one begins to care about subjective data quality when the objective assessment loses focus.

For example, in one environment, the data analysts evaluated the nullness percentages of the columns. Certain columns rang up alarmingly null; but when this was brought to the attention of the customer, it was explained that the data model being used had been purchased from a third party and that column was one that was not necessary for their operations. The attribute was not present in the legacy system and the new application never used it, so its presence remained under the radar with no one really caring about the effort to actually modify the data model. The upshot was, from an objective standpoint, the absence of data in the column was bad, but – from the subjective standpoint – irrelevant.

That being said, the effort to profile, review, raise an issue about and then ignore the null attribute was not necessary. In fact, that effort could have been avoided altogether had the analysts reviewed the model with the customer beforehand and singled out those data elements that were critical to the business process, and then focused their attention on profiling specific characteristics about those critical data elements. In other words, combining the bottom-up aspects of data profiling with the top-down process described at the beginning of this discussion provides a greater focus for the assessment and optimizes the effort for determining the best opportunities for data quality improvement.

Recent articles by David Loshin



Want to post a comment? Login or become a member today!

Be the first to comment!