We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Data Quality for Business Analytics

Originally published September 29, 2011

Once we are at the point where the candidate data sets have been selected for the analysis, it is incumbent upon the data analysts to ensure that the quality of the data sets satisfies the needs of the consumers of the analytical results. I have written many articles about data quality, so I do not want to repeat the same content here, other than focusing on four specific (and incredibly, related!) issues:

  1. Quality vs. correctness

  2. Precision of quality

  3. Supplier management (or the lack thereof)

  4. Data correction vs. process correction

The first issue involves establishing the difference between what is meant by “perfection” or “correctness” vs. defining and adhering to the levels of quality that are sufficient to meet the business needs. Some data quality analysts seem to suggest that data quality assessment should focus on characterizing the “correctness” or “incorrectness” of critical data attributes. The theory is that any obviously incorrect data attribute is probably a problem, and therefore counting the number (or percentage, if that is your preference) of records that are not “perfect” is the first step in identifying opportunities for data quality improvement.
The flaw in this line of thinking is the presumption of what can be inferred to be “obviously incorrect.” In some cases, a value that is potentially incorrect might be truly inconsistent with expectations when taken out of context, yet might not have any negative business impact for any number of reasons. And when it comes to business analytics, sometimes the right result can be delivered even in the presence in “imperfect” data.
So this leads to the second issue: How good does the data need to be? The effectiveness of quality improvement, like almost any activity, reflects the Pareto principle: 80% of the benefit can be achieved by 20% of the effort and, consequently, the last 20% of the benefit is achieved with 80% of the effort. If we need our quality rating to be 100%, that is going to have a high demand on resources to eke our way from 80% to 100%. But if a quality score of 80% is sufficient, that is a much smaller hill to climb.

So let’s think about the scenarios where this might make a difference. If we are doing operational reporting and need accurate statistics that will be compared to original source systems, then the quality of the data must reflect a high consistency between the data for analysis and the original source.

But if we are analyzing very large data sets for unusual patterns or to determine relationships, a small number of errors is not going to skew the results significantly. Take any of the large online retailers as examples – they seek to drive increased sales through relationship analysis and the appearance of correlated items within a market basket. If they are looking at millions of transactions a day, and a few show up as incorrect or incomplete, the errors are probably going to be irrelevant. This is the basic question of precision – how precise do we need to be in demanding adherence to our target scores for our dimensions of data quality? It may be an objective question but it relies on subjective opinions for the measures.

Of course, the level of quality of data used for analysis is directly dependent on the quality of the original sources. As long as the data sets are produced or managed within the same organization, it is not unreasonable to impose an operational governance framework in which flawed data items are investigated to determine the root causes that might be eliminated. But what do you do if the data sets used for analysis originate outside your organization’s administrative domain? That introduces our third data quality issue: supplier management.

The “supplier management” concept for data quality basically states that as a “customer” of data, you have the right to expect that the data “supplier” will meet your level of expectation for a quality “product.” There is a general concept of this unstated contract when we buy stuff in stores – if the thing is no good, we can bring it back for a replacement or refund. If we were paying for the data, the same contract should be in place.

But in many cases, that is not the situation – we use someone else’s data, with no exchange of value, and therefore we are not in a position to demand that the supplier put forth effort to meet our needs if this goes beyond their needs. This is even more of an issue when using data taken from outside the organization, especially if the data is “out in the open,” such as government data published through data.gov, or even screen-scraped data sets resulting from web queries.

This leads to our fourth issue: if we have no leverage with which to induce the data producer to make changes to meet our analytical usage needs, we must be left to our own devices to ensure the quality of the data. It is great to suggest that producer processes be evaluated and improved, but when push comes to shove, we cannot rely on the producer’s data quality. Therefore, if corrective action is needed to make the data usable, it might not be unreasonable to presume that be done by the user.

In essence, our four issues provide some level of balance needed for managing the quality of data that originates outside the administrative domain. In other words, to ensure the data is usable, understand the needs of the users for reporting and analysis, engage the data suppliers if possible to contribute to improving the quality of provided data, and if that is not possible, then take the matters into your own hands to ensure that the quality of the data is sufficient to meet your needs.

Recent articles by David Loshin



Want to post a comment? Login or become a member today!

Be the first to comment!