Oops! The input is malformed!
Originally published July 29, 2008
I created a workshop, Text Analytics for Dummies, for presentation before the start of this year’s Text Analytics Summit. Many folks who attend the summit are new to text analytics. The summit sponsors and I figured they could use a solid grounding in the technology and typical applications to help them understand sometimes-intense summit content. Our figuring was right on: I expected 20 workshop attendees but we had over 35. It occurs to me that the same conditions apply for readers of my Business Intelligence Network (BeyeNETWORK.com) text analytics channel. I’ve touched on technology underpinnings in previous articles, but I have never covered them comprehensively, hence this month's article, Text Analytics Basics. This article – the first of two parts – should be especially useful as background for my recently published Business Intelligence Network research report, Voice of the Customer: Text Analytics for the Responsive Enterprise, which is featured on BeyeRESEARCH.com.
I’ve posted my class slides on the web; they may be of some use even though many do not carry explanatory text. All the same, the overall text-analytics story should come through clearly. That story starts with placing the technology in terms of what people do with electronic documents:
For textual documents, text analytics enhances #2 and enables #3 and #4. Text analytics enriches indexing and search by discerning the concepts and relationships, which provide relevance-boosting context, behind search terms and document content. That is, text analytics enables search engines to provide more accurate results (as measured by both precision and recall, to be defined later) and improved results ranking and results presentation. Text analytics – text data mining, actually – provides the technology behind clustering, categorizing and classifying documents and their contents, supporting both interactive exploration of text-sourced information and automated document processing. And information extraction (IE) – pulling important entities, concepts, relationships, facts and opinions from text – is the key to including text-sourced data in business intelligence (BI) and predictive-analytics applications.
Enterprises now face an imperative, given the huge volume of textual information generated by enterprises and their stakeholders, to exploit “unstructured” sources to discern and act on opportunity and risk. For business intelligence, it’s Back to the Future. The original conception of BI, dating to a 1958 IBM Journal paper, A Business Intelligence System, defined business as “a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera,” and “the notion of intelligence... as the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” Notably, the paper's author, Hans Peter Luhn, focused exclusively on documents as an information source – business operations weren’t computerized in 1958 – and also on core knowledge management questions:
In a sense, for 45+ years, business intelligence detoured around the estimated 80% of enterprise information locked inaccessibly in textual form. The reason is clear. As Prabhakar Raghavan of Yahoo Research explains, “The bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.” So business intelligence thrived – crunching fielded, numerical, RDBMS-managed data, structured for analyses via star schemas and the like. And BI delivered findings via tables, charts and dashboards that focus more on numbers than on knowledge, on “interrelationships of presented facts” that “guide action toward a desired goal.”
But now, within the last few years, text technologies have matured to the point where they can meet the “unstructured data” challenge.
Sources of “unstructured” information that are of enterprise interest include:
They also include, for particular application domains:
These sources may contain “traditional” data. Witness a paragraph such as:
The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq composite gained 6.84, or 0.32 percent, to 2,162.78.
It’s easy to see how one could structure the content of this paragraph in a relational table. Four columns would do it: 1) Stock market index (primary key), 2) Date, 3) Change, and 4) Closing value. Percent change is a derived value and wouldn’t be stored. We’d have three rows, for “Dow,” “Standard & Poor’s 500,” and “Nasdaq composite.” Search is not up to this job; search can’t turn this or most other text into data.
Search returns links with brief summaries. That’s fine if you’re looking for a Shakira video or a particular document such as the H.P. Luhn paper I referenced above. But search won’t meet the analytical needs of enterprise competitive- and market-analysis, customer-support, product-management, compliance and other functions.
Analysts, however, need data and not just links. For analysts, the data that will answer their questions involves:
Nonetheless, search is the Web's killer app, (rightly) perceived as the best answer to the universal findability problem. And, as noted, text analytics can and does enrich search. Witness: If you type “population Peru” or “917+422” into the Google (or live.com or Yahoo!) search box, you get an answer and not just pages of links: the major Web-search engines recognize patterns, backed by lexicons that identify Peru, for instance, as a country, to deliver what the user is really looking for.
And sites such as Grokker and Touchgraph dynamically cluster results to assist users in understand information returned by major search engines. They do this by applying statistical methods that identify prevailing themes in the top-ranked search results and classifying results into the discerned clusters. This is text data mining, and data mining techniques such as link analysis and derivation of association and (other) predictive rules may equally be applied to text-sourced information. For this reason, we might characterize text mining – that term is, for all practical purposes, interchangeable with text analytics – as knowledge discovery in text.
So text analytics enhances search, a.k.a. information retrieval (IR):
But we want to get beyond IR to information extraction (IE), and that’s where text analytics really shines. IE will be the focus of Part 2 of this article, to appear next month. We’ll finish Part 1 with some formal (albeit idiosyncratic, i.e., personal) definitions, created with practitioners rather than theorists in mind. Start with text analytics.
Text analytics replicates and automates what researchers, writers, scholars and all the rest of us have been doing with natural-language sources for years. Text analytics:
Information extraction (IE) involves pulling features out of textual sources. What features might we be looking for? We define:
At a higher level of abstraction, we are looking for:
To discover content semantics, text analytics applies a variety of natural language processing (NLP) techniques. NLP typically involves a pipeline of steps that may include:
We apply certain data mining tools and techniques that help make sense of documents and extracted information. They include:
These techniques may be applied at the document level or at the feature level. For instance, we can cluster news articles by topic or other criteria, but we might also cluster entities – for instance, customer names by purchase dollar volume – in order to make the analysis task more tractable.
Stay tuned for Part 2, covering analytics steps, to appear next month. And continue monitoring my Business Intelligence Network channel for text analytics technology and solutions updates.
If you would like an in-depth, in-person introduction, consider attending my class at The Data Warehousing Institute, Text Analytics for BI/DW Practitioners, on August 19, 2008, at the TDWI conference in San Diego.
See you next month at BeyeNETWORK.com and/or in San Diego.
SOURCE: Text Analytics Basics, Part 1
Recent articles by Seth Grimes