We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: David Loshin Subscribe to this blog's RSS feed!

David Loshin

Welcome to my BeyeNETWORK Blog. This is going to be the place for us to exchange thoughts, ideas and opinions on all aspects of the information quality and data integration world. I intend this to be a forum for discussing changes in the industry, as well as how external forces influence the way we treat our information asset. The value of the blog will be greatly enhanced by your participation! I intend to introduce controversial topics here, and I fully expect that reader input will "spice it up." Here we will share ideas, vendor and client updates, problems, questions and, most importantly, your reactions. So keep coming back each week to see what is new on our Blog!

About the author >

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approachand Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

As part of working with a client on some technical aspects of their master data management program, I recently participated in a review of some of the record matching and linkage strategies being applied to consolidate data from a collection of source data systems. While listening to the conversations during the meetings, it occurred to me that without a reasonable understanding of how record linkage works, it is difficult to assess the suitability of an algorithm, business rules, or blocking strategies associated with any of the major duplicate analysis, matching engine, or MDM tools.

I suggested to the client that it would be worthwhile to know as much about linkage as the vendors do, and recommended Herzog, Scheuren, and Winkler's recent book on record linkage, "Data Quality and Record Linkage Techniques." This book is a really good resource to get an understanding of what record linkage is, how it works, and why it is important to a master data management activity, and the authors are well-known reserachers in the area of record linkage. Definitely worth reading, let me know what you think.

Posted August 27, 2009 7:30 AM
Permalink | 2 Comments |



I agree. The need for data matching solutions is central to MDM and one of the primary reasons that companies invest in data quality tools.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words. Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, and bipartite graphs.

Even with my industry experience, I still found the book that you recommended to be an excellent and invaluable resource.

Thanks and Best Regards...


Jim, on the contrary. There are indeed a number of vendors, but the basics of record linkage are fundamental to each implementation. Other than a few radical approaches, most seem to be improvements on earlier work, specifically that of Fellegi and Sunter, whose probability-based model has been widely adapted. The basics are described nicely by Winkler and Thibaudeau in (www.census.gov/srd/papers/pdf/rr91-9.pdf), and you can probably "get it" if your take the time read it carefully.

Most of those are not really advanced mathematical terms, but descriptions of approaches used, perhaps adjustments to Fellegi and Sunter. For example, in one paper I read about 10 years ago the authors explore approaches used for assuming initial probabilities. A little bit of statistics will tell you that Bayesian algorithms are driven by conditional probabilities, which is still aligned with the fundamentals ("what are the chances of X given Y").

When I sat with a representative from a vendor who was walking through matching strategies, I asked some straightforward questions: how do you do blocking? How does the variance in the attributes used for scoring adjust the weightings of the scores? What performance criteria are used to determine which attributes are use dfor blocking and which are used for matching? How are hard coded exception rules used to modify scores? Not only did I not get a satisfactory answer to any of these questions, it became apparent that the vendor representatives only understood how the product worked at a relatively high level.

For example, in limited populations, the set of last names may be much less diverse that in the general public. If so, then the contribution of matches of last name to the overall score have to be downgraded in weight, since there is a higher probability of a false positive. If this is true, then perhaps last name should be used as a blocking key and the match attributes should be phone number or SSN (or some other identifying attributes). It certainly is worth checking out, but that only makes sense because *I know how the algorithms work*. But that is not such a big deal, since I took the time to read a few papers, or a good book like the one I referred to in the blog entry, and that is why I suggested to my clients to read those artifacts.

As with any valuable endeavor, if you want to do a good job at it you take the time to accumulate the knowledge you need to do it well. Instead of presuming that these phrases are "advanced mathematical terms," I'd encourage people to check out what is really going on under the hood...

Leave a comment


Search this blog
Categories ›
Archives ›
Recent Entries ›