Posted November 29, 2010 12:56 PM
Permalink | No Comments |
Just published! My new book on data quality improvement, called The Practitioner's Guide to Data Quality Improvement was released a few weeks ago and is now available. The book provides practical information about the business impacts of poor data quality and provides pragmatic suggestions on building your data quality roadmap, assessing data quality, and adapting data quality tools and technology to improve profitability, reduce organizational risk, increase productivity, and enhance overall trust in enterprise data.
I have an accompanying web site for the book at www.dataqualitybook.com. At that site I am posting my ongoing thoughts about data quality (and other topics!) and you can download a free sample chapter on data quality maturity!
Please visit the site, check out the chapter, and let me know your thoughts by email: loshin@knowledge-integrity.com.
They say that data integration accounts for 80% of the effort of a data warehousing (or a variety of other enterprise application's) effort. But who are "they"? I know that the figure is often presented as the typical resource and time investment for data integration activities, but have not tracked down a source for it. I seem to recall seeing it in some data warehousing book, but do not remember which one.
Nonetheless, there is no reason for data integration to consume that amount of effort if the right steps are taken ahead of time to reduce the comfusion and complexity of ambiguous semantics and structure. I will discuss these issues at a webinar this Thursday, August 12 - hope you can make it!
So far I have seen a number of environments that have paid lip service to metadata as the be-all and end-all to solving all enterprise data issues and solidifying all enterprise data management needs. The reality seems to be that there is a lot of value for metadata in a number of instances although the value proposition for the investment in a full-scale implementation still seems to be lacking somewhat.
Some basic implementations cover data entity definitions, structures, and corresponding data element definitions and structure as well. Yet often the metadata repository is largely uni-directional, acting as a sink for data definitions etc., but having no "active" componentry that feeds back to the consuming applications.
The upshot is there is a need for a continuous investment in maintenance. However, those situations showing the criticality of metadata are those where the systems are changing - modernizations, migrations to ERP, MDM implementations. In essence, these are the places where the current system is being trashed and the data needs to move to a new system.
This is a true conundrum - there is a need to maintain the metadata (and a corresponding investment) while the systems are in use in preparation for their retirement. While the systems are in production, the metadata is not in great demand (since things are typically not going to change too much). This lowers the perceived priority of metadata management.
You do need it when you are changing things. Therefore you are going to not just throw out the existing system, but its reliance on the existing documented metadata. Therefore, the return is limited because you have invested a huge effort in maintaining something you about to retire. But I do need metadata when I am going to migrate data so I know what I have to work with.
And yet, metadata management is an indicator of good data management practices, and is likely to coincide with good system development and maintenance practices, lowering the need for system modernization.
So metadata is needed usually when I don't have it and is not needed when I do have it.
On top of that, the effort to maintain discrete information about the thousands (if not tens of thousands) of data elements used across an organization is gargantuan, which also limits the utility of a metadata resource 9since it will take forever to collect all the information).
The answer has got to be somewhere in between - "just enough metadata" to support existing application needs (for improvements and upgrades to functionality) and enough to support the processes needed to retire the applications and design their replacements.
Anyone have any experiences that can support this view? Post them!
I have been tinkering with some of the blogging tools out there (so far I like wordpress a lot). One nice aspect of the blogging framework is the expectation of meta-tagging of your content that helps in organization and presentation, which is quite nice because the system does some of the work that I have always been loathe to do (that is, "organizing things").
One way to do this is by categorizing your entries as well as adding additional tags. I was pondering this at some point, thinking that it should be possible at this point to use text mining tools to scan your content and pull out the "statistically improbable" phrases (as our friends at Amazon like to say) to be used as tags.
But what about non-text content? I can think of three commonly used content types that are growing in popularity yet require some extra thought for assigning meta-tags: pictures, voice recordings, and video recordings. As more of this unstructured stuff comes down the pike, we metadata folks should think hard about how to assess and capture semantics associated with these objects for the purposes of organization.
A few years back my friend Greg Elin put together a system for selectively annotating pictures. Check out his fotonotes web site. Perhaps there is some future in this for video?