What is Unstructured Data?

Originally published February 19, 2009

The world of commercial information technology has been created as a result of structured data. Structured data is data that occurs in a predictable manner, usually as a result of a transaction. When it comes to structured data, one thinks of banking activities, airline reservations, ATM activities, shop floor control systems and the like. In these environments, the same activity is executed repeatedly. The only difference from one transaction to the next is the parametric data that has been entered into the transaction.

All of this structured data and processing is the basis for 99% of the processing that occurs in the world of information systems today.

But the world of information processing is on the cusp of an entirely different kind of data and processing – unstructured data and processing. So what does the world look like when one takes a peek over the edge, peering into the new world of unstructured information?

The first question that is usually asked is – exactly what is unstructured data? A simple answer is that unstructured data is everything that structured data is not. The following are examples of unstructured data:

  • X-rays of the human skeleton

  • Telephone conversations

  • Pictures with colors and shapes

  • Videotapes and movies

  • Contracts

  • Text messaging

  • Pictures

For the most part, the technology underlying all but electronic textual data is so undeveloped that doing analytical processing against unstructured data is an impossibility (or if not an impossibility, at least a high impracticality). For the most part, the only real analytical processing that is done by unstructured processing is that is done by and on electronic textual data.

There are some interesting examples of unstructured data processing that fall outside of electronic text however. One of the most interesting unstructured technologies that is non textual is facial recognition.

In order to understand the value of facial recognition software, I recommend you read the book Bringing Down the House (Mezrich). This book is the true story of the students at MIT that took Las Vegas for millions of dollars playing team blackjack. (Read the book to see how team blackjack works; it is truly a fascinating story.)

Toward the end of their caper, the students were turned in by one of their own to the gambling casinos in Las Vegas. The casinos were given pictures of the students. Upon arriving at a casino, the students were recognized and escorted out of the casino. But having a lot of money, the students hired professional makeup artists to disguise themselves. Even with the best disguises that money could buy, the students were still recognized by the facial recognition technology. So there is at least one example of viable nontextual unstructured technology.

Another form of nontextual unstructured technology is that of voice recognition. In voice recognition technology, speech is transformed to electronic text. There has been voice recognition technology for a long time. Some of the complaints against voice recognition technology are that the technology does not recognize words perfectly. In even the best of circumstances, approximately 90% of the words are recognized.

The less than perfect voice recognition technology can be greatly enhanced by “training” the software to understand accents and colloquialisms. But even in the best of circumstances, there is a percentage of the words that are not recognized. The interesting thing is that when humans listen to another person, we do not understand 100% of the words said. Our brains are good at “filling in the blanks.” So if our brains do not hear at a 100% rate, why should we be concerned with software technology that does not hear at a 100% rate?

There have been attempts to look at and recognize colors, sizes and shapes. Every now and then, there is an announcement of a breakthrough. But for the most part, unstructured technology that deals with this aspect of unstructured information is a long way away from a commercial application.

Even when it comes to electronic textual recognition, perfection is not achieved. Some of the less than perfect aspects of electronic textual processing are:

  • The need to look at different languages

  • The need to manage synonyms and homographs

  • The need to understand textual structural mapping created by an author.

One of the biggest problems facing the analyst who wishes to address the usage of text for analytical processes is that of filtering text. When one looks at text, it is normal to find all sorts of text. Some text is important and useful to the business. Other text is irrelevant to the business. It is necessary to “weed out” the irrelevant text from the important and salient text.

Another issue is that of sheer volumes of text. It is estimated that in the standard corporation, there is at least 4 to 5 times the amount of text than there is structured information. From the sheer standpoint of size, something must be done to be able to filter out the unnecessary and the irrelevant text before the text can be meaningfully included in corporate analytical systems.

In addition, there is the simple process of transcription that has its own difficulties. One would think that reading text would be an easy process. And in most circumstances, that is the case. But for a variety of reasons, it turns out that merely reading text can be a challenge.

These then are some of the interesting things that one sees when looking over the edge of the cliff to the next horizon of information processing.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!