<-- Back to full color view

Data Quality: Anticipating Data Errors Initial Thoughts

Originally published October 28, 2010

I was doing a series of training sessions on data quality at a rather large organization. During one of the breaks, one of the participants mentioned to me that the organization at some point had a data quality group, but that group had been disbanded a few years back. Apparently, some of the leaders (i.e., those who hold the purse strings) had decided that focusing on process improvement would obviate the need for data quality management. However, despite the process improvement activity, there were still known issues with the data, as well as criticism of the senior managers who had eliminated the data quality program.

Of course, this was a bit surprising to me for a few reasons. First, I would like to believe that many data quality issues are actually process issues, and therefore fixing the process and eliminating the root cause of data issues should result in improved data quality. But, on the other hand, does that completely eliminate the need for data quality management? And if not, then what parts of data quality management are still required?

Here is another scenario that raises the question in a slightly different way:

  • A data error has been identified
  • A data rule has been defined that is used for inspection and notification when the error occurs
  • That data rule has been introduced into the information production flow to alert a data steward when the error occurs
  • At the same time, an analyst has reviewed the information production flow and identified the source of the introduction of the problem
  • A solution for adjusting the information production process has been proposed and implemented, thereby eliminating the root cause of the data error
At this point, we have addressed the error in two ways. The first involves checking for the error, while the other is intended to completely eliminate the possibility of the error occurring.  If I implement a process fix, the error should not occur anymore, so if I address the problem using the second approach, why do I need the first approach? In fact, if the process correction is right, that error should never appear again unless we change the process again or introduce different data that is not addressed by the existing process.

Now repeat this scenario many times – you end up with improved processes and a bunch of data controls inspecting errors that you should be confident you will never see again. So what value do we really get out of data quality inspection and monitoring? From this perspective, not too much – reported violations only provide continuous reminders that the problem has not been addressed. Taking it to the next level, I could see the justification for eliminating certain data quality activities, as long as processes are really improved as a result of identifying data errors.

However, the perception of the continued need for a data quality program indicates that there is some flaw in our thinking about process improvement and data quality management, and it may center on the concept of validating data with respect to previously identified errors. The challenge is not is monitoring for errors that you already know about – it is monitoring for errors that you don’t know about!

In addition to scanning for errors that already exist, we need to explore methods to scan for errors that do not already exist. But how is that done? The first step is to consider the types of errors that might occur and determine ways to anticipate those errors.

We already have a collection of typical dimensions of data quality such as accuracy, completeness, and consistency. As a simple example, let’s choose completeness as an area of anticipation. The traditional approach would profile the data and report which data elements are missing data. In turn, we can make a general presumption that all attribute values are required, and introduce monitors to alert a data steward any time any attribute’s value is missing. Sort of extreme, but this covers any completeness situation. Of course, the business users with subject-matter expertise for the data can probably tell the analyst when data attribute completeness is compulsory and when it is not, and that can reduce the potential overhead for a comprehensive set of complete validations.

I suspect that as a first cut, there are many common error scenarios that can be anticipated, many of which are more likely than others to occur. A systematic approach could examine the use of a critical data element (both in isolation and in the context of other data elements used within a business process) and consider the different ways the data can be flawed, how the flaw can be identified (even if it can’t be prevented), and what the mitigation/remediation processes are. As this could be a by-product of having a more comprehensive process for soliciting and documenting business user data requirements, it is worth considering what methods can be incorporated into that systematic approach to anticipating data errors.

SOURCE: Data Quality: Anticipating Data Errors

Recent articles by David Loshin



Want to post a comment? Login or become a member today!

Posted October 28, 2010 by Joe Celko jcelko212@earthlink.net

I am amazed at how few programmers today have any idea as to what a simple check digit is or how a regular expression can be used to validate data before it gets into the database. The math for a check digit is usually pretty simple (multiply, add, take a mod() and you are done) and you can Google a regular expression trickier encodings like VIN, URL, etc. and just do a cut & paste in your code.

When I design (or re-design) a database, I make a validation rule part of the specs and data dictionary.  Even if it is just a simple "obvious" value range check, like CHECK(order_qty >= 0). 

Is this comment inappropriate? Click here to flag this comment.


Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC