Definitions in Information Management
Originally published May 5, 2010
I have just finished my new book, Definitions in Information Management (now available via www.data-definition.com) and have been thinking about what I learned about definitions that has had the most impact on me. The main inspiration for this project was a long-term uneasy feeling I have had that definitions are a weak link in data management. I have always been surprised by the contrast between the vast amount that is written or said about tools in which definitions can be stored versus definitions themselves. Why are we much more concerned about the containers for definitions than we are about the content of definitions? Similarly, we hear a good deal about semantics, such as the Semantic Web initiative. Many people seem happy to discuss "semantic technologies," but when it comes to the fundamentals of semantics - to understanding meaning - there is really very little guidance.
The Mirage of Dictionary Definitions
Having found what to me at least were gaps in the way we think about definitions in data management, I began to look at how definitions are generally approached. The obvious place to look was at how dictionaries are built. Since my earliest days in school I have been using dictionaries, but while researching the book I gradually became aware that dictionaries may be misleading.
First of all, dictionaries nearly always have very abbreviated definitions, usually no more than a single sentence and often just a phrase. This had always influenced me to copy the dictionary style, for instance, when doing data modeling. I never really stopped to think about it. Yet, on reflection, it is obvious why dictionaries are written like this - it is to save printing costs. If full definitions had to be put in for each word, then a dictionary would be huge and hugely expensive (as some admittedly are). The reason that most dictionaries have the abbreviated style of definitions is purely an economic one. And it is one that works against creating good definitions.
There is another problem created by dictionaries and the way we use them in schools. Learning definitions by rote, particularly for tests like the SAT, supports the impression that definitions are always fully known in advance of any work involving what is being defined. We are often told that we must begin discussions of anything by fully defining the terms or concepts involved. But how can this be done? After all, we do scientific research to find out about things we do not yet understand fully and which we therefore cannot define. Surely that is also true for business analysts and data analysts in their work. The reality is that good definitions tend to come at the end of things rather than at the beginning. We seem to have been misled by the influence of dictionaries and the way we are taught to use them.
Process and Product
As I looked more closely at the issue of the impossibility of creating "up front" definitions, I was surprised to find that definition should be considered as much process as product. During all the time I have been data modeling, I have found that the emphasis was on which entities and attributes had definitions and which had not. It was as if there was a simple binary state where it was assumed that an analyst knew the definition or had not acquired it yet. But if we understand that definition is a process, then we can appreciate that at first we have a cloudy idea of what a concept is, but we gradually refine the definition. It is similar to gradually bringing an image into focus. The definition is subject to gradual improvement.
I had rarely seen this in data modeling. Definitions were usually either there or not there, and if they were there, they were either right or wrong. The idea of continuous improvement is an exception. However, it makes a lot of sense, and where I have seen it applied it has been impressive. What is interesting is that if we accept the idea of continuous improvement, then definition really is a process. And if it is a process, then governance must play an important role in it.
What kind of governance is needed for the process of definitions is something that will need to be refined, although I give some examples in the book. Almost certainly it will be collaborative and involve as many individuals across the enterprise as possible. However, there will probably have to be experts − trustees − for each definition who will be responsible for the quality of the definitions in their care.
The Role of Definitions
One of the other puzzles I encountered while working on the book is that many people in data management agree that definitions are important, but seem to get little value from them. Perhaps this has something to do with our daily experiences. We rarely explicitly worry about the definitions of things in our daily lives. Perhaps we should, but with data there is something special. Data has significance. It has meaning, and if we are to use it successfully it is essential to understand this meaning. The material things of our daily lives are not quite like this. Automobiles, furniture, foods, beverages, clothes and so can be defined, but our interactions with them are not dependent on definitions in the same way that using data is.
Might this attitude explain why there are so many analysts doing source data analysis, and why the same questions get asked about the same data over and over again. Do we only pay lip service to data but really treat it like a material object? If the analysts performing source data analysis really did record all the facts they discovered about the meaning of the data they examine, then it should make things easier for future analysts. If the individuals responsible for originally creating the databases examined by these analysts had done a good job of definitions, then the need for source data analysis would be greatly reduced.
Perhaps this paradox can be explained by the relatively recent growth in the importance of data-centric applications and the reuse of existing data. Prior to that, applications tended to focus on the automation of previously manual processes or the moving existing applications to newer technologies. If data has to be repurposed then its meaning suddenly becomes vastly more important, and good definitions are vital.
The other thing that struck me while I was writing the book is that in information management generally we are at a low level of maturity in thinking about definitions. Something like a "definition" is not necessarily just a single blob of unstructured text. We should be able to find structure within definitions and explode it out into a whole set of structured metadata. Perhaps this cannot be done, but it is worth a try because if it could be done, then the structured elements of definitions could be compared, and the promise of "understanding" data in initiatives such as the Semantic Web could truly be supported. But even if this is not attainable, good textual definitions could be extremely helpful to human analysts in data integration, where semantic reconciliation among sources and between the sources and the target is always important.
The overall conclusion that I have come to is that although there is a superficial appreciation of the importance of definitions, too few practical approaches have yet been created to use them effectively in data management. However, I am hopeful that we can rise from this relatively low level of maturity in the coming years.
Recent articles by Malcolm Chisholm
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC