We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Text is Text is Text: Not

Originally published August 26, 2010

As the world turns to the realization that text is as rich a source of information as classical transaction data, the focus starts to shift to the awareness of text all around us. At first glance, it is easy to say that all text is equal. In other words – text is text is text. But on closer examination, it becomes apparent that text is not text is not text. Not by a long shot.

In fact, the forms of text are as diverse as the people of the earth. Some people are short and some people are tall. Some people are heavy and some are light. Some people are old and others are young. To place all people in the same category shows an ignorance of reality.

Let’s consider some of the types of text:

  1. Email. Email tends to be short, a few sentences. While much of email can be considered personal, other email is relevant to business. The text that is used in an email can be either casual or formal. Email can contain rough and crude language. There are no rules about what can and cannot be included in email.

  2. Text messaging. Text messages tend to be very short, and there tend to be a lot of them. Text messaging is done in terms of abbreviations (R U there). Text messaging does not follow proper rules of grammar. Text messaging is almost exclusively personal. Much of text messaging is of a temporary nature. In many cases, text messaging depends upon a common context that is understood by both sender and receiver.

  3. Formal language. Technically speaking formal language follows all the rules of spelling, grammar and punctuation. Formal language is found everywhere – in the newspaper, in books, in technical manuals, and magazines. Spelling, punctuation and grammar are all part of formal text, but nothing is ever as simple as that. Look beneath the covers and you see that formal text is filled with dialects. Doctors have their formal text. Lawyers have their formal text. Engineers have their formal text. Technicians have their formal text. Indeed, formal text can be broken into many dialects and sub-dialects.

    Even within different regions of the country, there are other dialects of formal text. Compare Chicago to Memphis to Santa Fe. Writers also have their own dialects. Compare John Steinbeck with William Faulkner with Danielle Steel, and you will see yet another form of dialect. Formal language is a motley collection of dialects, all following the same general set of rules.

  4. Comments. Take a look at the notes doctors take and see what rules of grammar or spelling are followed. Doctors have their own rules about such things. And talk about stylized shorthand! Doctors have their own secret language. But when it comes to comments, it is not just doctors who have their own system. Most people who write comments have their own lingo and symbols.
There are many other forms of text, each with its own idiosyncrasies. There are words to music. There are poets. There are different languages. There is old English. Indeed, when you stop and think about it, text is not text is not text.

It is somewhat (actually a lot!) conceited to think that an approach to understanding text such as NLP (natural language processing) can capture the meaning and fabric of text, given the many forms of text that exist. Perhaps NLP is good for formal language where there are no dialects, but that represents only a fraction of the actual text in the world.

Another problem with the effectiveness of NLP is that NLP needs context to capture meaning. The problem with context is that most context is nonverbal. Context depends on many things such as the time of day, the location, the temperature, the location, the work being done, the activities of the moment, and even the station of the speaker and the listener. Words said by a father to a daughter are understood differently than those same words said by a police officer to a convict.

To think that natural language processing is adequate to capture and understand text is an assumption that is simply naïve and incorrect.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon



Want to post a comment? Login or become a member today!

Be the first to comment!