Text Exploration: Analyzing Text without the Hassle

Originally published August 7, 2013

Are We Listening?

Most organizations know how to turn their structured data into valuable insights. Countless reporting and analytical tools are available to assist them. But what about the textual data that has been gathered in emails, document management systems, call center log files, chat or instant messaging transcripts, and voice transcripts from customer calls? And don’t forget all the external textual data, such as blogs, tweets and Facebook messages? Most organizations have barely scratched the surface with respect to analyzing textual data. This is a missed opportunity.

In their book Mining the Talk,1 the authors Spangler and Keulen worded it as follows:
“People are talking about your business every day.
Are you listening?
Your customers are talking.
They’re talking about you to your face and behind your back.
They’re saying how much they like you, and how much they hate you.
They’re describing what they wish you would do for them, and what the competition is already doing for them.
They are writing emails to you, posting blogs about you, and discussing you endlessly in public forums.
Are you listening?”

Unfortunately, many organizations are not analyzing all the available text; they’re not listening.

This article describes the latest technology for analyzing textual data, so-called text exploration technology. Text exploration enriches the palette of technologies already deployed in business intelligence environments – it allows organizations to really listen to customers. What’s special about this category of products is that no need exists to develop ontologies or thesauri in advance. Right after the software has been switched on, text can be analyzed, even if the text comes from a new domain.

Note: This article contains extracts from the white paper Extending Business Intelligence with Text Exploration Technology.2

Should We Be Listening?

Organizations can benefit from analyzing textual data. For example, an insurance company may want to analyze all the contracts (textual documents) to find out how many of them expire within one year. A hospital may be interested in analyzing the descriptions written by specialists and included in patient files to discover patterns with respect to allergic reactions to medications. An electronics company may want to analyze messages on Twitter to find out if their products are mentioned positively or not – sentiment analysis. Transcripts of call center log files can be analyzed to determine whether there are popular questions, or whether specific products have been mentioned more than usual during the last couple of weeks.

How Do We Listen?

Most of the traditional text analysis techniques require work in advance. For example, thesauri and ontologies must be developed before the analysis can start. Such tools are very useful, but they can only be used if there is time to do all this work. What if a new and urgent question arises and this hasn’t been catered for in the thesaurus? Or what if new texts have to be analyzed right away?

Tools for analyzing text usually try to identify important concepts in sentences. For example, in the sentence “The enterprise search market is being reshaped by new consumer experiences,” the key concepts are enterprise search market and new consumer experiences. Most text analysis tools try to locate these concepts by looking at individual words, which results in the words consumer, enterprise, experience, market and search. They are considered key concepts in this text.

Some tools search for two-word phrases and even three-word phrases. The risk of this approach, however, can be that words are incorrectly “connected.” Take the following sentence as an example: "Michael Phelps breaks a world record." If two-word phrases are identified, the result contains the concepts Michael Phelps and Phelps breaks. Now, the first one is probably a useful one, but the second isn’t. And if we would search for all two-word phrases in the first sentence, we get enterprise search and search market, but not enterprise search market. This classic approach doesn’t guarantee that the linked words form the right concept.

In addition, to make sense of the sentences, developers have to build up thesauri and ontologies. This can be quite a lot of work and requires domain knowledge. For every domain a new thesaurus and ontology must be setup. In most situations, work on this will never end because new terms are being introduced continuously and the meaning of words can change. As an example take tweets – every day new important hash tags are introduced.

Furthermore, with most text analysis techniques the goal of the analysis exercise must be clear in advance. In other words, the tool is guided by the analyst. For example, search technology requires that one or more words are entered first. Another example is when patient files are analyzed to discover new insights with respect to the effect a particular medication has on patients with diabetes. As can be imagined, even when the same patient files are analyzed, a different thesaurus may be needed when the goal is to look for historical patterns in side effects after surgery. A thesaurus limits the analytical freedom and thus limits potential outcomes.

Listening with Text Exploration Technology

Many situations exist in which there is no time for all this preparation work. Here, technology is needed that allows text to be analyzed without having to develop a thesaurus, an ontology and a list of synonyms. This form of text analysis is called text exploration.

A hospital environment is a good example of where data exploration can be used. Imagine a patient is brought to the emergency room. If doctors must act quickly, they probably don’t have time to read the full patient file. What they need is a summary that shows all the important aspects related to the patient. Is he diabetic? Does he usually have a high blood pressure? What kind of medication is he taking? Has he been here before? To answer such questions, all the text in the patient file must be analyzed on the spot. The analysis should also be unguided because the doctors may know nothing of this patient.

Another example is analyzing tweets. Every day new words (abbreviations in many cases) and hash tags are invented. It would be undoable to constantly update a thesaurus. Furthermore, is there time to develop one?

To summarize, data exploration is a form of text analysis that meets the following three requirements:

No advance preparations: There should be no need to develop thesauri or ontologies before the analysis work can be started. It should be possible to start text analysis straightaway without any preparations, even if this is a new text covering a new domain.

Unguided analysis: Analysts should be able to invoke the text analysis technology without having to specify a goal in advance. The text analysis technology must be able to analyze the text in an unguided style.

Self-service: Analysts must be able to invoke the text analysis functionality without help from IT experts, although connecting the tool to particular data sources may require some assistance.

Listening with the Text Exploration Technology of iKnow

This section describes the text exploration technology called iKnow offered by InterSystems Corporation. iKnow supports the three key requirements for text exploration: no advance preparations, unguided analysis and self-service. iKnow doesn’t require nor support the development of thesauri and ontologies. It can analyze text coming from a domain or industry it has never analyzed before and is still able to discover the important concepts.

The text analysis approach taken by iKnow is different from many other approaches. It breaks texts into sentences, and sentences into concepts and relations. Breaking sentences is done by first trying to identify the relations in a sentence. Verbs can represent relations between concepts in sentences, but other language constructs can signify relations as well.

By starting to identify the relations, iKnow has a better chance of discovering the desired concepts. For example, in the sentence "The programmer found bugs," iKnow considers the verb found to be a relation between the concepts programmer and bugs. In iKnow this is called a concept-relation-concept sequence (CRC). Note that iKnow automatically discards all the stop words from sentences, such as the, an and he.

As indicated, other language constructs can indicate a relation. For example, in the sentence snippet “Mammals, such as elephants and lions …” a relation exists between mammals and elephants and one between mammals and lions. Another example is the sentence “I like the car in the showroom.” Here, the word in represents a relation between the concepts car and showroom. iKnow has been designed to recognize many different language constructs that can identify relations.

If the concepts and relations consist of multiple words, iKnow still recognizes them. For example, in the sentence “The enterprise search market is being reshaped by new consumer experiences,” iKnow discovers that the verb clause is being reshaped by represents the relation between the two concepts enterprise search market and new consumer experiences. These two concepts are called concept-concept pairs (CCs).

To summarize, because of this approach to analyzing texts, no need exists to develop an ontology or thesaurus beforehand. Without them, iKnow can still analyze texts and return various measures that can give valuable insights, such as the frequency of a concept in a text, the dominance of a concept and the proximity or the “semantic distance” between two concepts in a text.

Summary

A wealth of information is hidden in the vast amounts of textual data being created every day. The challenge for every organization is to extract valuable business insights from this mountain of textual data. However, current text analysis techniques require an investment upfront in developing thesauri and ontologies. The time to do this doesn’t always exist. This is where text exploration technology comes in. Almost every industry can benefit from deploying text exploration, especially those industries where storing text is crucial for business operations, such as advertising, healthcare, legal, pharmaceuticals, publishing and real estate.

This section mentioned iKnow as a potential data exploration technology. Unfortunately, this is the only product of its kind. Hopefully, the market of text exploration tools will grow quickly, because the need is high.

Note: For more information on text exploration and iKnow, see the white paper mentioned at the beginning of this article.

End Notes:
  1. S. Spangler and J. Keulen, Mining the Talk, Unlocking the Business Value in Unstructured Information, IBM Press, 2008.
  2. R.F. van der Lans, Extending Business Intelligence with Text Exploration Technology, June 2013 

  • Rick van der LansRick van der Lans

    Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

    Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Recent articles by Rick van der Lans



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!