In previous articles in this series (see Part 1, Part 2 and Part 3), we determined that corporate data can be divided into three classes – structured data, unstructured repetitive data, and unstructured nonrepetitive data. We discussed deriving business value from structured and unstructured repetitive data. In this article we will address the issue of deriving business value from unstructured nonrepetitive data.
Unstructured nonrepetitive data takes many forms – email, call center data, warranty claims, health care data, and so forth. From the standpoint of carrying with it a rich infrastructure of information about the data, unstructured nonrepetitive data has no such infrastructure. That makes doing analytical processing much more difficult than doing analytical processing in the structured world. There is no convenient definition of records, keys and attributes in the unstructured nonrepetitive environment. There is no apparent and obvious context of data in unstructured nonrepetitive data.
But even though it is not obvious, unstructured nonrepetitive data DOES carry with it both context and an infrastructure of data. It is just that the context and infrastructure of data is embedded inside the data itself. In order to bring that context and infrastructure to light, it is necessary to process the unstructured nonrepetitive data through a process known as “textual disambiguation” (or “textual ETL”). The process of textual disambiguation is one in which the raw data is read and the context and other descriptive information that is embedded in the data is derived. Once the context and descriptive information is found, it is then transformed into a standard infrastructure that fits comfortably with standard analytical processing. After the context and descriptive information is discovered and transformed into a standard format, the business analyst is able to do analytical processing against the unstructured nonrepetitive data.
The process of reading unstructured nonrepetitive data involves looking at the data from many different perspectives. It is necessary for textual disambiguation to view the data from many different perspectives to determine context and other important information about the data. Some of the many ways that the unstructured nonrepetitive data must be viewed are:
Through the filter of one or more relevant external taxonomies or ontologies: Words and phrases are interpreted through the lens of one or more externally applied taxonomies.
Through pattern recognition of words: On occasion it is possible to derive meaning through the recognition of patterns of words.
Through pattern recognition of the structure of single words: On occasion the very structure of a single word determines its context and meaning.
Through the proximity of words to each other: On occasion the proximity of words shapes their meaning and context.
Through the stemming of words: On occasion the interpretation of words goes back to their basic word stems.
Through meta-taxonomies: On occasion, words can be grouped together and context and relationships can be found by the creation of a meta-taxonomy.
There are also many other techniques that allow text to be read and understood.
After the unstructured nonrepetitive data has been passed through textual disambiguation, the result is data in a form that can be analyzed by the business analyst. Unstructured nonrepetitive data MUST be transformed before it can be turned into a useful format and structure.
SOURCE: Deriving Business Value from Unstructured Nonrepetitive Data
Recent articles by Bill Inmon