Data Quality: Anticipating Data Errors Initial Thoughts
by David Loshin
Originally published October 28, 2010
I was doing a series of training sessions on data quality at a rather large organization. During one of the breaks, one of the participants mentioned to me that the organization at some point had a data quality group, but that group had been disbanded a few years back. Apparently, some of the leaders (i.e., those who hold the purse strings) had decided that focusing on process improvement would obviate the need for data quality management. However, despite the process improvement activity, there were still known issues with the data, as well as criticism of the senior managers who had eliminated the data quality program.
Now repeat this scenario many times – you end up with improved processes and a bunch of data controls inspecting errors that you should be confident you will never see again. So what value do we really get out of data quality inspection and monitoring? From this perspective, not too much – reported violations only provide continuous reminders that the problem has not been addressed. Taking it to the next level, I could see the justification for eliminating certain data quality activities, as long as processes are really improved as a result of identifying data errors.
However, the perception of the continued need for a data quality program indicates that there is some flaw in our thinking about process improvement and data quality management, and it may center on the concept of validating data with respect to previously identified errors. The challenge is not is monitoring for errors that you already know about – it is monitoring for errors that you don’t know about!
In addition to scanning for errors that already exist, we need to explore methods to scan for errors that do not already exist. But how is that done? The first step is to consider the types of errors that might occur and determine ways to anticipate those errors.
We already have a collection of typical dimensions of data quality such as accuracy, completeness, and consistency. As a simple example, let’s choose completeness as an area of anticipation. The traditional approach would profile the data and report which data elements are missing data. In turn, we can make a general presumption that all attribute values are required, and introduce monitors to alert a data steward any time any attribute’s value is missing. Sort of extreme, but this covers any completeness situation. Of course, the business users with subject-matter expertise for the data can probably tell the analyst when data attribute completeness is compulsory and when it is not, and that can reduce the potential overhead for a comprehensive set of complete validations.
I suspect that as a first cut, there are many common error scenarios that can be anticipated, many of which are more likely than others to occur. A systematic approach could examine the use of a critical data element (both in isolation and in the context of other data elements used within a business process) and consider the different ways the data can be flawed, how the flaw can be identified (even if it can’t be prevented), and what the mitigation/remediation processes are. As this could be a by-product of having a more comprehensive process for soliciting and documenting business user data requirements, it is worth considering what methods can be incorporated into that systematic approach to anticipating data errors.
Recent articles by David Loshin
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC