We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Preparing Text for Analytics Textual Disambiguation: The Wave of the Future

Originally published June 7, 2012

Text is everywhere – in email, in contracts, in “big data.” Everywhere you look you find text.
For years text has been the original persona non grata of technology. Data warehouses are built on transactional data. Entire careers have been built in IT without ever having to address text. But – for a variety of reasons – the day is dawning where text is becoming part of the corporate decision-making landscape.

When you take a look at the full spectrum of data in a corporation, only 20% or less of the data is of the non-structured variety. It comes as a surprise to many people that transactional data only makes up a small minority of the data in the corporation.

Why hasn’t text been a standard part of the decision making of the corporation if it is true that text forms the vast majority of the data in the corporation. There are many challenges with text, but perhaps the most fundamental challenge is that text is non repetitive. Text is written in a form where there is no structure to the composition of the text. Standard database management systems (DMBSs) simply do not accommodate non-repetitive data. Standard database management systems are built to handle occurrences of data that are the same structure, one occurrence after the other. Trying to stuff text into a standard relational DMBS is like trying to put the proverbial square peg in a round hole.

Textual ETL

Fortunately, now there is textual ETL. With textual ETL you can efficiently and meaningfully place text into a standard relational database management system. There are many functions that are accomplished by textual ETL, and one of the primary functions is to accommodate the needs of a DMBS. But textual ETL does something much more important than accommodate the needs of a database management system. Before text can be used for analytical processing it MUST be disambiguated – in every case. Raw text cannot be used for analytical processing. Textual ETL disambiguates raw text.  

What does it mean to disambiguate text? How in the world do you take raw text and supply the context that is needed in order for the text to be used in analytical processing? This is a complex question with many facets to the answer. But, in general, the steps to textual disambiguation are:
  • Gather and organize all relevant and related sources of text
  • Translate the text into a common and intelligible language
  • Do simple edits on the text
  • Categorize the text using taxonomies and ontologies
  • Organize the text to fit into a standard relational DBMS
Once the raw text has passed through these steps, then the raw text is fit for analytical processing.

It should be noted that some of these steps are very complex unto themselves. For example, categorization of text can occur in many ways. Some text can be categorized by comparison to a taxonomy. Other text can be categorized only in the context of the logical sub-structuring of the text as it exists within a document. Other text can be categorized only in terms of its physical placement within a document, and some text can be categorized in terms of its relation to commonly occurring words and phrases. Indeed, the creation of textual categorization is a very large subject unto itself.

In any case, before it is safe – even before it makes sense – to use text for decision making, it is necessary to disambiguate the text.

Textual disambiguation is a new concept to most people. Perhaps some people are too set in their ways to learn new techniques. There is plenty of existing data to keep those people happy. But for those young energetic curious people, there is a brave new world – a world where many of the limitations of the past no longer exist.
  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon

 

Comments

Want to post a comment? Login or become a member today!

Posted July 15, 2012 by Anonymous

Hi Bill,

What do you think are some lessons learned from database management for transactional data that can be brought to bear on text and unstructured data? What should we keep in mind if we are thinking about best practices for the development life cycle? This is a rather large question obviously, so any thoughts you can share on the topic would be illuminating.

 

Many thanks,

Leza Zaman

Is this comment inappropriate? Click here to flag this comment.