Data Quality for Business Analytics
by David Loshin
Originally published September 29, 2011
Once we are at the point where the candidate data sets have been selected for the analysis, it is incumbent upon the data analysts to ensure that the quality of the data sets satisfies the needs of the consumers of the analytical results. I have written many articles about data quality, so I do not want to repeat the same content here, other than focusing on four specific (and incredibly, related!) issues:
The first issue involves establishing the difference between what is meant by “perfection” or “correctness” vs. defining and adhering to the levels of quality that are sufficient to meet the business needs. Some data quality analysts seem to suggest that data quality assessment should focus on characterizing the “correctness” or “incorrectness” of critical data attributes. The theory is that any obviously incorrect data attribute is probably a problem, and therefore counting the number (or percentage, if that is your preference) of records that are not “perfect” is the first step in identifying opportunities for data quality improvement.
So let’s think about the scenarios where this might make a difference. If we are doing operational reporting and need accurate statistics that will be compared to original source systems, then the quality of the data must reflect a high consistency between the data for analysis and the original source.
But if we are analyzing very large data sets for unusual patterns or to determine relationships, a small number of errors is not going to skew the results significantly. Take any of the large online retailers as examples – they seek to drive increased sales through relationship analysis and the appearance of correlated items within a market basket. If they are looking at millions of transactions a day, and a few show up as incorrect or incomplete, the errors are probably going to be irrelevant. This is the basic question of precision – how precise do we need to be in demanding adherence to our target scores for our dimensions of data quality? It may be an objective question but it relies on subjective opinions for the measures.
Of course, the level of quality of data used for analysis is directly dependent on the quality of the original sources. As long as the data sets are produced or managed within the same organization, it is not unreasonable to impose an operational governance framework in which flawed data items are investigated to determine the root causes that might be eliminated. But what do you do if the data sets used for analysis originate outside your organization’s administrative domain? That introduces our third data quality issue: supplier management.
The “supplier management” concept for data quality basically states that as a “customer” of data, you have the right to expect that the data “supplier” will meet your level of expectation for a quality “product.” There is a general concept of this unstated contract when we buy stuff in stores – if the thing is no good, we can bring it back for a replacement or refund. If we were paying for the data, the same contract should be in place.
But in many cases, that is not the situation – we use someone else’s data, with no exchange of value, and therefore we are not in a position to demand that the supplier put forth effort to meet our needs if this goes beyond their needs. This is even more of an issue when using data taken from outside the organization, especially if the data is “out in the open,” such as government data published through data.gov, or even screen-scraped data sets resulting from web queries.
This leads to our fourth issue: if we have no leverage with which to induce the data producer to make changes to meet our analytical usage needs, we must be left to our own devices to ensure the quality of the data. It is great to suggest that producer processes be evaluated and improved, but when push comes to shove, we cannot rely on the producer’s data quality. Therefore, if corrective action is needed to make the data usable, it might not be unreasonable to presume that be done by the user.
In essence, our four issues provide some level of balance needed for managing the quality of data that originates outside the administrative domain. In other words, to ensure the data is usable, understand the needs of the users for reporting and analysis, engage the data suppliers if possible to contribute to improving the quality of provided data, and if that is not possible, then take the matters into your own hands to ensure that the quality of the data is sufficient to meet your needs.
Recent articles by David Loshin
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC