Blog: David Loshin Subscribe to this blog's RSS feed!

David Loshin

Welcome to my BeyeNETWORK Blog. This is going to be the place for us to exchange thoughts, ideas and opinions on all aspects of the information quality and data integration world. I intend this to be a forum for discussing changes in the industry, as well as how external forces influence the way we treat our information asset. The value of the blog will be greatly enhanced by your participation! I intend to introduce controversial topics here, and I fully expect that reader input will "spice it up." Here we will share ideas, vendor and client updates, problems, questions and, most importantly, your reactions. So keep coming back each week to see what is new on our Blog!

About the author >

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approach and Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in Challenge Category

On Monday, march 15 I conducted a full-day master data management tutorial at the Enterprise Data World conference. As a forum for discussing pragmatic MDM best practices, one section of the day was set aside for a panel discussion among representatvis from four vendor products:

  • Dan Soceanu from DataFlux
  • Marty Moseley from Initiate Systems - An IBM Company
  • Ravi Shankar from Informatica (formerly Siperian)
  • Jim Walker from Talend

I posed the question to all four: What defines data as "master data"? The first round of answers focused on the standard answer: data concepts that "are important" to the business and are shared by two or more applications. My reaction to this response was that it was not a practical guide, and then rephrased the question: What can the people in the audience do when they got back from the conference to start identifying data entities as master data?

Again, I did not get the answers I was looking for - all four suggested that the task was not one that could be done at your desk, that it required knowledge of the business, that subject matter experts had to be consulted.

All true, but again, not executable, so I reframde the question again: knowing that there was bound to be variation, replication, duplication, redundancy, differences in semantics, what is a process for reviewing the data to decide which data element of which data entities belongs in a unified master view.

At that point the answer became a little clearer: you can't tell unless you understand what each data element is, how it is used, what its definition was, how many application sused, in what type of usage scenarios. In addition, you needed oversight of the process for analyzing the data and capturing the results, sharing, and having all that information validated by subject matter experts.

As moderator, I responded by summarizing: "in order to determine what data is master data, you need to analyze the data, document all the information about the data, and have policies for overseeing that process. That sounds like data profiling, metadata management, and data governance." (nods all around)

But is has to be more than that; there has to be a more operationalized method that results in a clear determination of which data elements of which data entities are to be mastered.

Posted March 17, 2010 1:01 PM
Permalink | 3 Comments |

Currently at the Dataflux IDEAS conference, and have sat through two sessions in which the speakers are discussing how the amount of data is exploding, with the implication that we need more effective data governance to manage this flood. While I cannot disagree with the sentiment, I'd have to suggest that buried within both speakers' messages lies the challenge:


Instead of attempting to address new data challenges with traditional approaches, we need to reinvent data governance in the context of the changing ways that people are using that data.

The more data there is, the more difficult it is to filter out the signal from the noise. How does one distinguish between the requirements for overseeing signal and those for noise? More to follow...

Posted October 6, 2009 8:02 AM
Permalink | No Comments |

An interesting article about people leaving facebook caught my eye because it resonated with some of the same issues I have had with it - inspired nosiness, misrepresentations of the concept of a friend (vs. connection), the way some people become obsessed and absorbed into it, and other observations.

After I had signed up (prodded by an old friend with whom I had fallen out of touch), I started to see others from my (growingly hazy view of the) past contact me asking to be connected. I guess I just said yes, and ended up with some connections, which led to other requests, etc.

So facebook is a little different than my other social network,, which is valuable to me as a business tool. Facebook does not provide that value, although it is interesting to see what people I used to know a long time ago are doing (hmm, a little nosy there, eh?).

The problem is that there are reasons that I stopped being in touch with a lot of former acquaintences, and getting back in touch with people that I no longer have much in common with is interesting at first but benign moving forward. And despite the few situations in which I am connectede with someone I regret losing touch with, it makes me have to actively ignore people that I have been able to passively ignore for a good twenty years or so.

On the other hand, there are some folks (like my friend Jeremy Epstein) who are building careers out of exploiting social marketing, and from an information perspective, there seems to be a lot of opportunity (check out Stephen Baker's book Numerati for some good examples as well).

I am interested - what is your experience with Facebook - as a connectivity tool, as a business tool, as an entertainment forum? post your comments!

Posted September 3, 2009 10:03 AM
Permalink | 4 Comments |

I was just scanning Philip Russom's October 2007 monograph on "Unifying the Practices of Data Profiling, Integration, and Quality," and noticed this:

"The quality of data degrades as application update, add data to, or delete data from a database. Most estimates say that 10-12% of the data in an active database becomes dirty, nonstandard, or redundant each month. Hence, if you cleanse a database 100% today, it will only be 88-90% clean 30 days from now."

I love finding sentences like this that provide some (hopefully objective) third-party providing a hard statistic that can be used as ammunition for supporting a data quality initiative. On the other hand, I do get concerned when it an unnamed source (the "most estimates" part, I mean) is used to provide the statistic.

Actually what would happen if the data is never cleansed? How about for six months? Does the 10-12% degradation apply to all the records or only to currently clean ones? OK, simple arithmetic - if each month 10% of the clean records become dirty, nonstandard, or redundant each month, then after 6 months (.9*.9*.9*.9*.9*.9) * 100%, or 53.1% of the records are unsullied.

I am not sure that I believe this to be the case. Perhaps the rate at which data becomes dirty slows each month, since those records with a higher propensity to become flawed (for what ever reason - multiple touchpoints, commonly-used records, tc.) will have already been subjected to an error, so they would not be counted the second month?

If anyone has any references to this 10-12% number, post a comment with a link!

Posted July 15, 2009 1:06 PM
Permalink | 1 Comment |

There is an oft-quoted statistic about the growth rate of data volumes that I wanted to use in some context, and I started searching for a source. I googled "data volumes" +"double every" to see what I could find, and to my surprise, lots of hits, but it is difficult to pin down the exact parameters. Lots of folks are using the statistic:

"Data doubles every year"
"The amount of stored data from corporations nearly doubles every year"
"...the amount of data stored by businesses doubles every year to 18 months."
"In his book “Simplicity,” business management expert and author Bill Jensen indicates that the most conservative estimates show business information doubling every three years, while some estimates say data doubles every year. "
"Unstructured data doubles every three months"

I am still following links from the first page of results, and we are doubling our data every 3 to 18 months.

"Reed's Law states that the volume of data doubles every 12 months. "

OK, so there is actually a law about it. Hold on a second, according to wikipedia this law is about the utility of (social) networks, so perhaps the law doesn't apply in all jurisdictions.

Anyway, these may all be references to a UC Berkeley study on the growth of data , which said that the amount of information stored on media such as hard disk drives doubled between 2000 and 2003.

So let's look at this a little more carefully - we have a scientific study that looks not at the creation of data, but rather the use of storage media to hold what is out there. And out there is a lot of stuff needing a lot of storage, like images, music, videos, etc. Things that have information yet from which are still a challenge to extract data. Also, consider that for each thing out there, there are likely to be a lot of copies! I am sure that a scan of all the TiVos in the country would demonstrate that lots of people are still catching up on older episodes of 24 and American Idol.

I need to refine my question a little bit, then, but I am afraid it will be difficult to track down defensible sources for it. I am more interested in knowing about the growth rate for data that can be integrated into an actionable information environment. I may not care about the bits comprising that specific episode of 24 that is sitting on millions of DVRs, but as an advertiser, I might be interested in profiling which households have watched which episodes and at what kind of time shift.

Anyone have any ideas?

Posted January 23, 2008 10:48 AM
Permalink | No Comments |
PREV 1 2 3


Search this blog
Categories ›
Archives ›
Recent Entries ›