Is Text Really Unstructured Data?

Originally published March 6, 2014

Structured data is repetitive data that occurs over and over. Banking transactions, reservations, retail sales (SKU data) and telephone calls are all classical examples of what is known as structured data. Structured data fits neatly inside a standard database management system (DMBS).

And then there is text. Text is commonly referred to as unstructured data. And – prior to Forest Rim – text did not fit comfortably and conveniently into a standard database management system.

But is text really unstructured?

The term “unstructured” refers to a lack of structure. And if text were really unstructured, we wouldn’t be able to understand each other when we have a conversation. But we do understand each other when we speak. So what is going on here?

There is definitely structure behind text. There is proper spelling. There is proper punctuation. There is proper sentence construction. There is proper thought development.

Indeed there really is structure behind text, but that structure is quite complex. Language is taught in school from the first grade on. It takes a long time for a human to learn how to speak and also to learn to understand speech. And the deeper you go into language, the more arcane and complex it becomes. Indeed, you can get a doctorate in language and make it your life’s work.

While there really is a structure behind text, does that structure allow the text to be considered to be structured in the eyes of the computer? The answer is no, because even though text is structured, that structure is so vast, so complex and so arcane that the computer cannot understand the structure of language. Therefore, in the eyes of the computer, text is unstructured, even though there really is an underlying structure to text.

The same is true of log tape data. Log data is cryptic. Have you ever looked at a log tape and tried to make sense of it? But there is an order – a structure – to log tape data. The structure is almost not obvious. But somewhere there is someone who knows how to read and make sense of the data on a log tape. So a log tape represents unstructured data even though at least one person in the world knows how to make sense of the log tape.

The whole notion of unstructured then is structure relative to the computer that has to read and manage the unstructured data. If the data is unstructured, it is unstructured in the eyes of the computer reading and managing the unstructured data. To other eyes, text  is anything but unstructured.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!