Originally published August 26, 2008
This article concludes my series on an introduction to text analytics, based on a workshop, Text Analytics for Dummies, which I presented at this year's Text Analytics Summit. It's designed to provide solid grounding in the technology and typical applications and should be especially useful as background for my BeyeRESEARCH report, Voice of the Customer: Text Analytics for the Responsive Enterprise.
Last month, I covered the glossary slides, through slide 33. The story to that point is captured in the statement that “text analytics looks for structure that is inherent in the textual source materials.” And the technical essence is that text analytics typically applies a pipeline (a succession) of basic steps:
If that summary is rough going for you, refer to the glossary toward the end of last month’s article. Continue on now for concrete examples of how statistics and linguistics are used by software to uncover meaning in text.
Retrieving documents for analysis may involve web search or spidering, it may mean reaching into an email or content management system, it may involve subscribing to a feed via RSS or to a service, and it may mean using other means of identifying and selecting a set of interesting materials. The essential point is that no analysis can or should expect to access all of the uncountably large set of web and enterprise sources, most of which will constitute noise if you have well-formulated business goals. It’s an axiom that the scope and nature of information intake should follow from business needs.
Lexical analysis is essentially basic reduction and statistical analysis of text and the words and multi-word terms it contains. By reduction, I mean use of text elements such as punctuation to decide what the words and terms are, and also application of morphological rules and transformations that look at word forms – for instance, that stem words to recognize, for example, that “analysis,” “analytics” and “analyze” have the same root and therefore, we infer, similar meaning.
Let’s put morphology aside and look at basic statistical counting applied to text. Let’s learn by doing, using a website, Ranks.nl, that was designed for search engine optimization (SEO). The theory is that if your page has a high proportion of occurrence of certain key terms – individual words or multi-word n-grams dominate your page – and if your page is part of a network of links to topically similar pages, then it will come up as a top response to searches on those terms. The idea is to use tools such as the keyword density and prominence analyzer at http://ranks.nl/tools/spider.html, to design your page for findability, but let’s go to that page for another purpose – to see how basic lexical analysis can help a computer understand text.
At that page, enter the URL http://altaplana.com/SentimentAnalysis.html – an earlier BeyeNETWORK.com article of mine – or the URL for any not-too-large page you wish. Scroll down the results page to the “Single word repeats” section. For my article, three (sentiment, text, analytics) of the top six words relate to the article’s topic, but the other three (for, that, from) in the top six are pretty useless. So, instead, scroll down to “Total 3 word phrases.” You’ll see top terms “customer experience management,” “enterprise feedback management” and “of text analytics.” Those tri-grams actually capture the “whatness” – the subject matter – of my article pretty well.
Basic statistics gets a computer pretty far toward understanding text. Stats – co-occurrence of terms, term proximity, document similarity, statistical pattern detection and matching – can take you a lot further, but for the purposes of this introductory article, we’ll move on to demonstrate how linguistics can help.
Syntactic analysis derives understanding from sequences of language elements, from words and punctuation. Essentially, it maps natural language into a set of grammatical patterns to determine useful stuff like parts of speech. And understanding parts of speech – subject, verb, object along with various attributes and modifiers – is the key to discerning facts and relationships within textual sources, the key to turning text into data.
We’ll again use a free web tool to learn by doing. Let’s use Connexor’s Machinese Phrase Tagger. Enter a sentence – say “We’ll again use a free web tool to learn by doing.” – and click "Apply Tagger."3 You’ll see that the software recognizes not only nouns and verbs but also the roles within the larger phrases and sentence played by each of the words. Play around a bit, and then go to the syntactic parser at http://www.connexor.eu/technology/machinese/demo/syntax/ and enter the same sentence to graphically diagram the sentence – the kind of task many of us did in elementary school. You don’t need to understand the morphological tags the software creates in order to understand that this automated parsing represents a big step in transforming text into data.
The Connexor site’s Machinese Metadata demo page illustrates an additional text-analytics function, entity extraction, albeit only for persons (in the demo). Try it with the text “the string Breck Baldwin refers to a particular person and is therefore a named entity of type person” and see what you get.
I like to illustrate more sophisticated information extraction (IE) – identifying entities such as persons, companies, stock ticker symbols, geographical locations, etc. – using Gate, an open-source text analytics tool from the University of Sheffield in the United Kingdom. I’ll do that now via a screenshot after one last do-it-yourself exercise. Click the links in this sentence to ask yahoo.com about the ticker symbol IBM and ask google.com for population peru. You will see that the top search engines do recognize certain named entities (in addition to recognizing, for instance, that “map” associated with a geographic area is likely a question rather than a search request.)
Named entities are ones that are found in some form of look-up, whether a gazetteer of geographies or some other form of lexicon (dictionary) or a taxonomy. This screenshot shows annotation – mark-up of recognized terms – of my sentiment analysis article using Gate. In addition to named entities such as persons and companies, other entities are recognized via pattern matches. Gate (and similar tools) will annotate other text features such as sentences and text already marked-up with XML or HTML tags.
As you can see in this screenshot, accuracy isn’t perfect: Gate missed a person name, a name of south Asian origin. But I ran the system with out-of-the-box settings using a standard pipeline of text-analysis steps: I didn’t augment the out-of-the-box annotation with manual corrections or machine learning. I didn’t bring in specialized dictionaries or bases of linguistic patterns and rules tailored to particular business domains or sources. All these additional elements can boost accuracy.
I’ve knowingly skipped a number of text analysis techniques.
One is clustering, categorization and classification (terms defined in Part 1 of this article) of documents and document-contained entities. You can see those text-analytics functions in action, for instance via a Grokker search of Yahoo! and Wikipedia on the term “text analytics.” The software applies data mining algorithms to cluster the results thematically; that is, it classifies individual web pages into result clusters according to similarity with other cluster pages.
Text summarization and machine translation are additional applications of text analytics. Both involve automated sense-making and text transformation. Where content analysis is concerned, however, our main interest is in information extraction to databases, which allows exploratory analysis of the data content of text using familiar business intelligence (BI) tools and techniques.
Compatible structuring of text-sourced information and database-sourced information enables the ability to do BI on text. (By database-sourced information, I mean structured data that originates in transactional or operation systems that is extracted to an analytical database, a data warehouse, and restructured using ETL tools.) We use taxonomies to organize entities and concepts (features) from text that are analogous to dimensions in a standard BI model. For example, geographic areas are frequently important whether you’re working with textual sources or dimensional models, say for corporate sales.
We can have both discovered classifications (taxonomies) of text features, just as we can use data mining to discover important variables and set of attributes in conventional datasets. The key to successful integration of text-sourced and database-sourced data is semantic compatibility, that is, using the same metadata to describe data from disparate sources.
For BeyeNETWORK readers, some of the most interesting applications of text analytics will be in the customer experience management domain, in the use of text mining and sentiment analysis for voice of the customer research designed to boost customer satisfaction ratings, identify product and service quality issues, hone marketing tactics and provide competitive intelligence. For in-depth discussion, please consult my recent BeyeRESEARCH study, Voice of the Customer: Text Analytics for the Responsive Enterprise, and come back for my next article, in which I plan to cover open source text analytics.
Recent articles by Seth Grimes