Text Analytics Basics, Part 1

Originally published July 29, 2008

I created a workshop, Text Analytics for Dummies, for presentation before the start of this year’s Text Analytics Summit. Many folks who attend the summit are new to text analytics. The summit sponsors and I figured they could use a solid grounding in the technology and typical applications to help them understand sometimes-intense summit content. Our figuring was right on: I expected 20 workshop attendees but we had over 35. It occurs to me that the same conditions apply for readers of my Business Intelligence Network (BeyeNETWORK.com) text analytics channel. I’ve touched on technology underpinnings in previous articles, but I have never covered them comprehensively, hence this month's article, Text Analytics Basics. This article – the first of two parts – should be especially useful as background for my recently published Business Intelligence Network research report, Voice of the Customer: Text Analytics for the Responsive Enterprise, which is featured on BeyeRESEARCH.com.

I’ve posted my class slides on the web; they may be of some use even though many do not carry explanatory text. All the same, the overall text-analytics story should come through clearly. That story starts with placing the technology in terms of what people do with electronic documents:

  1. Publish, manage and archive.

  2. Index and search.

  3. Categorize and classify according to metadata and contents.

  4. Information extraction.

For textual documents, text analytics enhances #2 and enables #3 and #4. Text analytics enriches indexing and search by discerning the concepts and relationships, which provide relevance-boosting context, behind search terms and document content. That is, text analytics enables search engines to provide more accurate results (as measured by both precision and recall, to be defined later) and improved results ranking and results presentation. Text analytics – text data mining, actually – provides the technology behind clustering, categorizing and classifying documents and their contents, supporting both interactive exploration of text-sourced information and automated document processing. And information extraction (IE) – pulling important entities, concepts, relationships, facts and opinions from text – is the key to including text-sourced data in business intelligence (BI) and predictive-analytics applications.

Back to Future for Business Intelligence

Enterprises now face an imperative, given the huge volume of textual information generated by enterprises and their stakeholders, to exploit “unstructured” sources to discern and act on opportunity and risk. For business intelligence, it’s Back to the Future. The original conception of BI, dating to a 1958 IBM Journal paper, A Business Intelligence System, defined business as “a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera,” and “the notion of intelligence... as the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” Notably, the paper's author, Hans Peter Luhn, focused exclusively on documents as an information source – business operations weren’t computerized in 1958 – and also on core knowledge management questions:

  • What is known?

  • Who knows what?

  • Who needs to know?

In a sense, for 45+ years, business intelligence detoured around the estimated 80% of enterprise information locked inaccessibly in textual form. The reason is clear. As Prabhakar Raghavan of Yahoo Research explains, “The bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.” So business intelligence thrived – crunching fielded, numerical, RDBMS-managed data, structured for analyses via star schemas and the like. And BI delivered findings via tables, charts and dashboards that focus more on numbers than on knowledge, on “interrelationships of presented facts” that “guide action toward a desired goal.”

But now, within the last few years, text technologies have matured to the point where they can meet the “unstructured data” challenge.

The “Unstructured Data” Challenge

Sources of “unstructured” information that are of enterprise interest include:

  • Email, news and blog articles; forum postings; and other social media.

  • Contact-center notes and transcripts.

  • Surveys, feedback forms and warranty claims.

  • And every kind of corporate document imaginable.

They also include, for particular application domains:

  • Scientific papers.

  • Legal and court filings.

  • Case reports for intelligence, law enforcement and insurance.

These sources may contain “traditional” data. Witness a paragraph such as:

The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite gained 6.84, or 0.32 percent, to 2,162.78.

It’s easy to see how one could structure the content of this paragraph in a relational table. Four columns would do it: 1) Stock market index (primary key), 2) Date, 3) Change, and 4) Closing value. Percent change is a derived value and wouldn’t be stored. We’d have three rows, for “Dow,” “Standard & Poor’s 500,” and “Nasdaq composite.” Search is not up to this job; search can’t turn this or most other text into data.

Search returns links with brief summaries. That’s fine if you’re looking for a Shakira video or a particular document such as the H.P. Luhn paper I referenced above. But search won’t meet the analytical needs of enterprise competitive- and market-analysis, customer-support, product-management, compliance and other functions.

Analysts, however, need data and not just links. For analysts, the data that will answer their questions involves:

  • Entities: names, email addresses, phone numbers.

  • Concepts: abstractions of entities.

  • Facts and relationships.

  • Abstract attributes, e.g., “expensive,” “comfortable.”

  • Opinions, sentiments: attitudinal data

  •  ... and sometimes BI objects: for instance, the data “cube” that responds to a query of the form “Who were the top 4 salespeople for each product line, region and quarter for the last two years?”

Nonetheless, search is the Web's killer app, (rightly) perceived as the best answer to the universal findability problem. And, as noted, text analytics can and does enrich search. Witness: If you type “population Peru” or “917+422” into the Google (or live.com or Yahoo!) search box, you get an answer and not just pages of links: the major Web-search engines recognize patterns, backed by lexicons that identify Peru, for instance, as a country, to deliver what the user is really looking for.

And sites such as Grokker and Touchgraph dynamically cluster results to assist users in understand information returned by major search engines. They do this by applying statistical methods that identify prevailing themes in the top-ranked search results and classifying results into the discerned clusters. This is text data mining, and data mining techniques such as link analysis and derivation of association and (other) predictive rules may equally be applied to text-sourced information. For this reason, we might characterize text mining – that term is, for all practical purposes, interchangeable with text analytics – as knowledge discovery in text.

So text analytics enhances search, a.k.a. information retrieval (IR):

  • It recognizes patterns in search queries to enable basic question answering.

  • It recognizes patterns in search results to enable clustering of results.

But we want to get beyond IR to information extraction (IE), and that’s where text analytics really shines. IE will be the focus of Part 2 of this article, to appear next month. We’ll finish Part 1 with some formal (albeit idiosyncratic, i.e., personal) definitions, created with practitioners rather than theorists in mind. Start with text analytics.

Basic Definitions

Text analytics replicates and automates what researchers, writers, scholars and all the rest of us have been doing with natural-language sources for years. Text analytics:

  • Applies linguistic and/or statistical techniques to extract concepts and patterns that can be applied to categorize and classify documents, audio, video and images.

  • Transforms “unstructured” information into data for application of traditional analysis techniques.

  • Unlocks meaning and relationships in large volumes of information that were previously unprocessable by computer.

Information extraction (IE) involves pulling features out of textual sources. What features might we be looking for? We define:

  • Entity: Typically a name (person, place, organization, etc.) or a patterned composite (phone number, email address).

  • Concept: An abstract entity or collection of entities.

  • Fact: A relationship between two entities.

  • Sentiment: A valuation at the entity or higher level.

  • Opinion: A fact that involves a sentiment.

At a higher level of abstraction, we are looking for:

  • Semantics: A fancy word for meaning, as distinct from Syntax, which is structuring.

To discover content semantics, text analytics applies a variety of natural language processing (NLP) techniques. NLP typically involves a pipeline of steps that may include:

  • Parsing: Evaluating the contents of a document.

  • Tokenization: Identification of distinct elements within a text.

  • Stemming/Lemmatization: Identifying variants of word bases created by conjugation, declension, case, pluralization, etc.

  • Tagging: Wrapping XML tags around distinct text elements, a.k.a. text augmentation.

  • POS Tagging: Specifically identifying parts of speech.

We apply certain data mining tools and techniques that help make sense of documents and extracted information. They include:

  • Categorization: Specification of ways like items can be grouped.

  • Clustering: Creating categories according to statistical criteria.

  • Taxonomy: An exhaustive, hierarchical categorization of entities and concepts, either specified or generated by clustering or created as a hybrid of both top-down (specified) and bottom-up (generated).

  • Classification: Assigning an item to a category, perhaps using a taxonomy.

These techniques may be applied at the document level or at the feature level. For instance, we can cluster news articles by topic or other criteria, but we might also cluster entities – for instance, customer names by purchase dollar volume – in order to make the analysis task more tractable.

  • Accuracy: How well an IE or IR task has been performed, computed as an F-score that weights precision and recall.1

  • Precision: The proportion of information found that is correct or relevant.

  • Recall: The proportion of information found of information available.

In More Depth...

Stay tuned for Part 2, covering analytics steps, to appear next month. And continue monitoring my Business Intelligence Network channel for text analytics technology and solutions updates.

If you would like an in-depth, in-person introduction, consider attending my class at The Data Warehousing Institute, Text Analytics for BI/DW Practitioners, on August 19, 2008, at the TDWI conference in San Diego.

See you next month at BeyeNETWORK.com and/or in San Diego.

References:

  1. See Text Analytics Accuracy: Requirements and Reality for more detail.

 

  • Seth GrimesSeth Grimes

    Seth is a business intelligence and decision systems expert. He is founding chair of the Text Analytics Summit and principal consultant at Washington, D.C., based Alta Plana Corporation. Seth consults, writes, and speaks on information-systems strategy, data management and analysis systems, IT industry trends, and emerging analytical technologies. Seth chairs the Sentiment Analysis Symposium and the Text Analytics Summit.

    Editor’s Note: More articles and resources are available in Seth's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Seth Grimes



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!