Originally published August 20, 2009
Note: Some of the techniques and approaches discussed herein are intellectual property protected by pending patents.
One of the most confusing and misunderstood aspects of the integration of raw text into a form that is useful for textual analytics is that of converting specific text into generic text. In order to understand why such a conversion is necessary, consider the issue of terminology. Text has the property that it is used over time and by many different types of people. People with different backgrounds and geographies may talk of the same thing using different terms. Lawyers have their own vocabulary. Doctors have their own vocabulary – and specialists may have vocabularies that differ from that of general practitioners. Therefore, if there is to be a surmounting of the challenge created by terminology, it is necessary to think of some words as specific words and other words as generic words.
A specific word is generally a word in a class. The class of words is the generic word. There are many examples of specific and generic words. Some simple examples are:
These classes of specific and generic data provide the key to getting through the barrier created by different terminologies. In order to create a specific and generic reference to text, one way to proceed is to write the generic reference in the same location as the specific reference. For example, the textual ETL tool reads “Ford” and writes “car” in the same place as the word “Ford.” In fact, in every place where “Ford,” “Toyota,” “Chevrolet” and “Porsche” are encountered, the word “car” is added.
In fact, some words may have more than one generalization. The word “Porsche” may have the categories of “car,” “sports car” and “luxury item.” All of these generic categories may apply to “Porsche”. In addition there may be multiple levels of categorization. For example, sitting above “car” may be “transportation.” So, generic categorizations may be hierarchical.
In addition, there may be more than one type of categorization, and one type of categorization may be favored over another form of categorization. For example, the word “Ford” may appear in the specific categorization of both “car” and “former President.” If the document being addressed is from Detroit and Motor Trends magazine, then the favored category would be “car.” But if the document is about discussions regarding the pardoning of President Nixon, then the favored category would be “former presidents.”
The different forms of categorization are typically created using taxonomies. A taxonomy is derived for a body of words. A body of words may contain many different taxonomies. Once derived, a taxonomy can be useful in many other places. For example, suppose a taxonomy is derived for Sarbanes Oxley. The taxonomy most likely will be useful in many places other than the one in which it is derived. Thus, taxonomies take on a life of their own.
While all of this may be useful for describing how terminology may be handled in integrating textual data, the real, practical value is probably not apparent at all. In order to explore the practical value of generic and specific treatment of text in the integration process, let’s consider an example.
Suppose that there was information about activities happening in the United States. Suppose the text has words such as “Phoenix,” “El Paso,” “Wichita,” “Walla Walla” and “Roanoke.” There is nothing wrong with these words. But they represent a very low level of specificity. Suppose the level of generics were applied to these words. Suppose “American town” was added everywhere the words appeared. Now there would be “Phoenix/American town,” “El Paso/American town,” “Wichita/American town” and so forth.
When the query tool goes to make a query, a query can be made either at the specific level or the generic level. A query can be made for “Tucson,” and if any references to “Tucson” are found, the query is satisfied. But a query can also be made to “American town.” In this case, if there are any references to “Tucson,” they will be found, along with the references to other American towns. So data that is both specific and generic can be located.
From the standpoint of data modeling, it is recognized that the practice of abstracting data is exactly the opposite of that learned conventionally. In conventional data modeling, a high level abstraction is created. Typically, this is an ERD. The high level model is “fleshed out” to a lower level of abstraction, where keys and attributes are added. Finally, the model is abstracted down to its lowest level.
When operating on textual data, the process happens in reverse. The data modeler starts with the textual data. After the text is examined, the most basic abstractions are recognized. Then higher levels of abstraction are created.
The higher levels of abstraction show up as generic categories of data, or taxonomies. The taxonomies are applied against the raw text to create a structuring of data that serves to address the issues of terminology.
Recent articles by Bill Inmon