Blog: David Loshin Subscribe to this blog's RSS feed!

David Loshin

Welcome to my BeyeNETWORK Blog. This is going to be the place for us to exchange thoughts, ideas and opinions on all aspects of the information quality and data integration world. I intend this to be a forum for discussing changes in the industry, as well as how external forces influence the way we treat our information asset. The value of the blog will be greatly enhanced by your participation! I intend to introduce controversial topics here, and I fully expect that reader input will "spice it up." Here we will share ideas, vendor and client updates, problems, questions and, most importantly, your reactions. So keep coming back each week to see what is new on our Blog!

About the author >

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approachand Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

August 2005 Archives

Believe it or not, even a savvy person like me is subject to fraud once in a while. Perusing our most recent credit card bill, I came across three charges that were clearly not ours - small charges made at gas stations in another state. When we called the credit card company, they determined that our card had been duplicated, since the charges had been swiped through a card reader. Apparently, at some point recently our credit card data must have been double-swiped through a magnetic card reader and then transferred to a duplicated card.

The duped card was then used for small-ticket purchases at innocuous locations intended to evade the bank's fraud detection algorithms. The pattern is the fraudsters pilot a couple of small charges, and if the account holder doesn't shut off the card, much larger charges are made.

Once the bank was made aware of the situation, they immediately cancelled the card and erased out charges. When asked about whether they would investigate the fraud, the customer service representative (CSR) said that they don't bother with these kinds of small amounts, but just write it off.

Ever wonder how much money is lost due to small-scale fraud? The CSR told us that $20,000,000.00 is written off each quarter! I think, though, that it would be possible to use BI techniques to track down this illegal behavior...

The kinds of charges that appeared were interesting: $33.10, $45.00, and $70.00. Of the 3 charges, only one was not a round-dollar amount, and the second and third charges were done at the same location almost at the same time.

Back in April, I wrote an article for B-EYE-Network on the use of Benford's Law for Information Analysis. In this article, I described a digital analysis phenomenon regarding the distribution of numeric digits in large data sets. The truth is, Benford Analysis has been used primarily as an auditing technique to look for fraudulent behavior , and I am confident that (with a little thought) a reasonable use of the technique could help in identifying transaction patterns involving different duplicated credit cards.

Individuals are likely to repeat their bad behavior, and even if they think that they are creating random dollar amount charge sequences, each may reflect a particular signature that identifies the perpetrator, and by analyzing the geographic density of the illicit charge locations, pinpoint a reasonable location to start to track down the offenders.

Anyone out there with experience in fraud detection using Benford's Law? What do you guys think?


Posted August 26, 2005 2:29 PM
Permalink | 30 Comments |

Is it better to clean data on intake or after it has been processed?

Let's say you have a data entry process in which names and addresses are input into a system. At some point within your processing, that same data (name and address) will be forwarded to an application performing a business process, such as printing a shipping label. However, it is not necessarily guaranteed that the individual whose name and address was input will ever be sent anything.

You desire to maintain clean data, and you are now faced with two options: cleanse the data at intake or cleanse it when it is used. There are arguments for doing both of these options...

On the one hand, a number of data quality experts advocate ensuring that the data is clean when it enters your system, which would support the decision to cleanse the data at the intake location.

On the other hand, since not all names and addresses input to the system are used, cleansing them may turn out to be additional work that was not necessary. Instead, cleansing on use would limit your work to what is needed by the business process.

Here is a hybrid idea: cleanse the data to determine its standard form, but don't actually modify the input data. The reason is that a variation in a name or address provides extra knowledge about the individual - perhaps a nickname, or a variation in spelling that may occur in other situations. Reducing each occurrence of a variation into a single form removes knowledge associated with potential aliases, which ultimately reduces your global knowledge of the individual. But if you can determine that the input data is just a variation of one (or more) records that you already know, storing the entered version linked to its cleansed form will provide greater knowledge moving forward.


Posted August 23, 2005 6:22 AM
Permalink | No Comments |

I am confident that, when properly distilled out of the masses of data, that the essence of knowledge lies embedded within an organization's metadata. In fact, conversations with clients often centers on different aspects of what we call metadata, often buried within topics like "data dictionary," "data standards," "XML" - yet the meaning of corporate knowledge always boils down to its metadata.

I have recently been involved in advising the formation of a new professional organization that focuses on establishing a community of practice for Metadata practitioners called the Meta-Data Professional Organization.

The intention of the MPO is to be a primary resource for exchanging ideas and advice for best practices in metadata management and implementation. I hope that this organization will be the kind of group in which individuals will share their knowledge and experience in a way that can benefit others, especially when it comes to some of the more challenging aspects of metadata, such as clearly articulating the business benefits of a metadata management program, how to assemble a believable business case, and how to develop a project plan for assessment and implementation.

If you check out the board members, you will probably see some names familiar to you from other venues, such as TDWI or DAMA.

If you have any interest in metadata, it would be worthwhile to consider how you could contribute to this organization!


Posted August 23, 2005 6:04 AM
Permalink | 1 Comment |

Do the structures described within XML schemas correspond to classes and objects described in Java or C++, or to entity relationship models? There seems to be a little bit of a debate on the topic. As an example, there does seem to be a correlation, which leads to the ability to automate the generation of Java classes that mimic XML schemas (see The Sun Java XML Binding Compiler for details).

On the other hand, the flexibility in defining schemas allows a clever (albeit, devious) practitioner to define structures that would challenge any object-oriented programmer.

I am currently looking at a project where we are reviewing the way XML schemas are defined in a way that eases the design of its supporting software. I have some definite ideas about this, but I'm interested in hearing some ideas from you readers. I will follow up on this topic, perhaps in an upcoming article.


Posted August 16, 2005 9:07 PM
Permalink | No Comments |

We hear a lot about open source software and its potential benefits to the marketplace. How about the concept of open source data? The idea is creating a repository of data that is readily available, can be configured for business benefit, and is collectively supported by a development community.

One place to start is with public data, such as what is available from the US Census Bureau.

Every 10 years, the US Census Bureau conducts a census, and as part of that process, collects a huge amount of demographic data about on avery granular level, geographically. They then spend the next 5+ years analyzing the data and preparing it for release, while at the same time preparing for the next decennial census.

The problem is that sometimes, by the time the decennial data is released it no longer accurately reflects an area's demographics. For example, consider how rapidly real estate prices have risen in the past 5 years - yet 2005 home prices are not captured in Census 2000 data. Similarly, the Tiger/Line data that contains information about street addresses is occasionally updated, but new streets and subdivisions are constantly being built, so it is likely that there are omissions in the Census data set.

There are many other public domain, public records, or generally available data sets that are of great interest to the BI community. So here is the challenge: Tell me how you feel about a project to take on a publicly available data set and create an "Open Source" approach to managing various approaches to maintaining and presenting that data. One example might be taking the Census decennial data and formulating it into a relational data structure mapped across the geographic Tiger/Line data? Pose your ideas as comments to this enrty...


Posted August 14, 2005 1:21 PM
Permalink | No Comments |
PREV 1 2

   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›