Blog: David Loshin Subscribe to this blog's RSS feed!

David Loshin

Welcome to my BeyeNETWORK Blog. This is going to be the place for us to exchange thoughts, ideas and opinions on all aspects of the information quality and data integration world. I intend this to be a forum for discussing changes in the industry, as well as how external forces influence the way we treat our information asset. The value of the blog will be greatly enhanced by your participation! I intend to introduce controversial topics here, and I fully expect that reader input will "spice it up." Here we will share ideas, vendor and client updates, problems, questions and, most importantly, your reactions. So keep coming back each week to see what is new on our Blog!

About the author >

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approachand Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

December 2008 Archives

It is currently the holiday break, which means two things. First, almost everybody is taking time off, which means that (second) there is a little bit of breathing room for us to sit and ponder issues pushed into the background during the rest of the year. One of those items has to do with data quality scorecards, data issue severity, and setting levels of acceptability for data quality scores.

Essentially, if you can determine some assertion that describes your expectation for quality within one of the commonly used dimensions, then you are also likely to be able to define a rule that can validate data against that assertion. Simple example: the last name field of a customer record may not be null. This assertion can be tested on a record by record basis, or I can even extract the entire set of violations from a database using a SQL query.

Either way, I can get a score, perhaps either a raw count of violations, or a ratio of violating records to the total number of records; there are certainly other approaches to formulating a "score," but this simple example is good enough for our question: how do you define a level of acceptability for this score?

The approach I have been considering compares the relative financial impact associated with the occurrence of the error(s) against the various alternatives to address them. On one side of the spectrum, the data stewards can completely ignore the issue, allowing the organization to absorb the financial impacts. On the other side of the spectrum, the data stewards can invest in elaborate machinery to not only fix the current problem, but also ensure that it will never happen again. Other alternatives fall somewhere between these two ends, but where?

To answer this question, let's consider the economics. Ignoring the problem means that some financial impact will be incurred, but there is no cost of remediation. The other end of the spectrum may involve a signficant investment, but may address issues that occur sporadically, if at all, so the remediation cost is high but the value may be low.

So let's consider one question and see if that helps. At some point, the costs associated with ignoring a recurring issue equal the cost of preventing the impact in the first place (either by monitoring for an error or preventing it altogether). We can define that as the tolerance point - any more instances of that issue suddenly make prevention worth while. And this establishes one level of acceptability - the maximum number of errors that can be ignored.

Calculating this point requires two data points: the business impact cost per error, and the cost of prevention. The rest is elementary arithmetic - subtract the prevention cost from the business impact cost, and if you end up with a positive number, then it would have been worth preventing the errors.

My next pondering: how can you model this? More to follow...


Posted December 30, 2008 11:45 AM
Permalink | No Comments |

One good thing about being busy is that you get opportunities to streamline ideas through iteration. My interest in data profiling goes pretty far back, and the profiling process is one that is useful in a number of different usage scenarios. One of these is data quality assessment, especially in situations where not much is known about the data; profiling provides some insight into basic issues with the data.

But in situations where there is some business context regarding the data under consideration, undirected data profiling may not provide the level of focus that is needed. Providing reports on numerous nulls, outliers, duplicates, etc. may be overkill when the analyst already knows which data elements are relevant and which ones are not. In these kinds of situations, the analyst can instead concentrate on the statistical details associated with the critical data elements as a way to evaluate the extent to which data anomalies might impact the business.

So in some recent client interactions, instead of just throwing the data into the profiler and hoping that something good comes out, we narrowed the focus to just a handful of data elements and increased the scrutiny on the profiler results, sometimes refining the data sets, pulling different samples, segmenting the data to be profiled, joined different data sets prior to profiling, all as a way to get more insight into the data instead of the typical reports telling me that yet another irrelevant data element is 99% null. The upshot is that a carefully planned process for driving the directed profiling process gave much more interesting results, both for us and for the client.


Posted December 23, 2008 1:51 PM
Permalink | 1 Comment |


   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›