It is stated that in most corporations 80% of the data is textual. In some corporations – insurance companies, for one – that ratio may be low. And for a long time it has been held that textual data cannot be manipulated by a computer. Textual data is notoriously non repetitive, and non-repetitive data simply does not fit well into a standard database management system. Database management systems are built for data with a structure that repeats itself over and over.
The result of not committing text to a computer is that doing analysis of the text is beyond cumbersome. Trying to look at and analyze text manually is an impossibility, not an improbability. At some point, there simply is so much text that trying to look at and analyze it manually simply cannot be done, even for the largest and wealthiest corporations on earth.
But now there is textual ETL (extract, transform, load). With textual ETL, it is possible to read text and place the text in a standard relational database management system. Now text can be treated like any other data (well, almost any other data, as we shall see).
The challenge of treating text like any other data is that text obeys its own set of rules – text marches to the beat of its own drummer. Trying to apply the time-tested principles of data modeling, functional decomposition, systems analysis and design simply do not work. It is like saying the way to learn to fly an airplane is to learn to ride a bicycle. Riding a bicycle and flying an airplane may have a few similarities, but, in fact, very few.
In order to deal with text, it is necessary to delve into library science. Library science has been around for quite a while, even predating the computer. Library science is full of practices and technologies such as the Dewey Decimal System, taxonomies and ontologies, parsing, and other arcane disciplines. The problems faced by the librarian are very different than the problems faced by the online transaction processing (OLTP) systems programmer or the data warehouse database designer.
Thus it is that a whole new lore needs to be explored in order to start to take text and make sense of it.
However, there are pitfalls along the way. Consider NLP – natural language processing. NLP certainly has its place, and natural language processing has had its victories (see Jeopardy and Alex Trebek). But NLP is not a general purpose solution. There are many pitfalls to NLP – the primary one being that context of text is often – normally – decided in a non-textual manner.
Let’s consider a situation that explains the importance of context. Two guys are standing on a street corner in Houston, Texas. A woman passes by, and one guy says to the other – “She is hot.” Now what is meant here? Is the lady young and attractive? Or, is it hot and humid in Houston and the lady is sweating profusely? Or, is the lady angry because she just received a parking ticket? The words “she is hot” can mean lots of things, and the context of the words depends on many non-verbal things. The best parsing algorithm in the world cannot tell what the words mean.
So NLP has its value, but the value is inherently limited.
Instead there is another whole set of techniques and disciplines that are needed in order to transform data from text into a database. There are many facets to these techniques and approaches. In order to understand them, take just one simple issue that arises. That issue is that text takes many forms. Certainly there is the standard well formed text. This is the text that your English teacher taught you in school. There is proper spelling and proper punctuation. There are verbs and nouns and adverbs and prepositions. Certainly this form of text must be handled. But there are MANY other forms of text. Consider IM (instant messaging where the number of bytes is limited so there are standard non-textual symbols such as LOL, 4 and 2. Or there are doctors’ notes, and doctors have their own form of shorthand. Or consider the messages passed through a log tape.
The list goes on and on. There are many different forms of text and ALL of them must be accounted for if you are serious about turning text into a database. NLP cannot handle non-textual context, much less poorly formed text. So there has to be a better way to approach the problem of making text usable.
Indeed the day has arrived that text can be placed into a standard relational database management system. And in doing so, analytical processing like the world has never even imagined is now a reality.
SOURCE: Textual Data – A Brief Sojourn
Recent articles by Bill Inmon