Originally published October 30, 2007
Text analytics is a new IT discipline that has already proved itself in applications ranging from pharmaceutical drug discovery to counter-terrorism to survey analysis, in science, government, and industry. It is poised to break out into the broader analytics market, in workbench form, integrated with business intelligence solutions, embedded in line-of-business applications, and enabling semantic search.
Text analytics is an answer to the “unstructured data” problem, which is best expressed by the truism that eighty percent of enterprise information originates and is locked in “unstructured” form. That problem has been recognized for decades. In fact, the first definition of business intelligence (BI) itself, in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, describes a system that will:
“…utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the ‘action points’ in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points.”
So we see that the earliest BI focus was on text – on extraction, categorization, and classification rather than on numerical data!
Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in “unstructured” documents is hard to process. We went after the low-hanging fruit – the fielded, numerical data – in response to the analytics imperative that any business process worth conducting should be measurable, and that any data worth collecting should be analyzed.
After two decades of numbers-focused business intelligence, analytical tools and techniques – reporting, OLAP, data mining, ETL and data warehousing – are well understood and have been widely adopted. BI software is now a commodity technology, so lately market attention has turned to a new “old” challenge – text. The difference is that we now have technology that is uniquely capable of extending conventional data analysis solutions to the full breadth of enterprise information assets.
Text analytics as a technology has its roots in linguistics and data mining; but in recent years, it has broken out of the lab into the wider analytics world, first via extensions to data mining workbenches and more recently in the form of term-extraction and analysis interfaces. The ability to discern features in text – for instance, personal and geographic names, dates, telephone numbers and e-mail addresses, as well as concepts and even sentiments – and to extract them to databases is now an important feature of leading ETL tools. We have recently started seeing line-of-business applications that rely on text analytics (for instance, for automated processing of news feeds) that demonstrate the technology’s new maturity.
The broad market now understands that keyword search and over-structured portals as information access pathways are not sufficient. Concepts and relationships are key, and flexible, intention-aware navigation to information is an achievable next step. Progressive BI shops understand that the possibility of achieving much-talked-about 360o views – the ability to hear the voice of the customer (for customer relationship management) and the voice of the market (for product development, marketing, and competitive intelligence) – relies on tapping the volumes of information previously locked in textual form. As the adoption of real-time and predictive analytics and of Web 2.0 technologies grows, the importance of staying on top of fast-paced news feeds, blog, and message boards grows correspondingly.
It is important to understand technology basics.
Text analytics first emerged in the late 1990s as “text data mining” or just “text mining.” Early approaches would treat a text source as a “bag of words.” They evolved to use basic, shallow linguistics to handle variant word forms such as abbreviations, plurals, and conjugations as well as multi-word terms known as n-grams. Basic lexical analysis might count frequencies of words and terms in order to carry out elementary functions such as attempting to classify documents by topic. But there was no ability to understand the semantics – the meaning – contained within a given document.
Data mining and, by extension, text data mining carry the analytical process a step further. Data mining looks for hidden relationships and other, complex patterns within datasets. Techniques include classification, clustering, link analysis, decision trees, and the like. Predictive modeling is used for business functions such as credit-scoring, fraud detection risk analysis, and forecasting to project trends in time-dependent information. These techniques can all be applied to data derived from textual sources, albeit with adjustments to accommodate, for instance, the high dimensionality of text-derived information if every distinct term has naively been turned into an analytical dimension. Researchers have adapted or developed statistical techniques to deal with these issues, such as singular-value decompositions and support vector machines for dimensionality reduction. Coupled with machine-learning algorithms (neural networks and the like) and deeper linguistics that support functions such as semantic disambiguation (using context to understand whether a reference to “Ford” is about a car or a president or a place to cross a river), text becomes tractable.
The earliest text-mining users were investigators such as intelligence analysts and biomedical researchers looking for needles in haystacks: the terrorist, known or unknown, who might be detected through a pattern of actions and associations; a protein whose presence activates or inhibits a genetic pathway, leading to a cancer, that might be known by mining biomedical literature. These are very important applications, and this style of analysis has since been adopted for diverse applications ranging from fraud detection to creation of investment strategies.
Researcher-users, often with advanced degrees in statistics or linguistics or related fields, are well able to handle the complexity inherent in using advanced analytics through workbench interfaces. But recently, a new, different style of text analysis has grown in prominence: the use of familiar business intelligence interfaces and techniques to gain a broad understanding of trends without excessive concern for individual cases – for those needles in haystacks. And recently, feature-extraction capabilities have moved beyond entities such as names and e-mail addresses to events and sentiments. When software can parse “the service was lousy” and initiate an appropriate action, and when it can start to comprehend the non-literal human meaning of a phrase such as “Yeah, that idea is a real winner,” then we’ve made real progress in achieving H.P. Luhn’s 1958 goals for business intelligence systems. Text analytics is now starting to do just those things. This adaptation and strengthening of text analytics creates new possibilities.
Text analytics becomes another asset in the integrated analytics toolkit. We have the flexibility to take any of several integration approaches, choosing whichever approach best fits the character of the data and our business needs. We can extract features to structured data warehouses, where text-sourced information may be analyzed alongside numerical data generated by operational applications. And rather than working within a data or text mining workbench, an approach that would best suit analytics power users, we can now work with text from within line-of-business applications.
Text analytics continues to evolve rapidly, so we end this brief history with a look ahead. To understand what's next for text analytics, consider the grand challenge for text mining as articulated by Ronen Feldman, a pioneer in the field. Feldman spoke at a 2006 Association for Computing Machinery panel about text mining systems that will be able to pass standard reading comprehension tests such as SAT, GRE, GMAT, etc. According to Feldman, such a system must achieve excellent entity recognition and relation extraction, with very high precision (relevance of retrieved information) and recall (ability to find all information that is relevant). Per Feldman, these systems should work in any domain, operating totally autonomous without human intervention, and should analyze huge corpuses (sets of documents) and come up with “truly interesting findings.” There are perhaps other elements: the ability to understand what a linguist would call exophora, or references that are not resolved in examining a single source document; the ability to assess accuracy; and the creation of a mechanism for assessing and weighting the correctness of identified responses in order to formulate a single, best response.
The pace of technical improvement is rapid, in keeping with the very high value that text analytics is delivering to enterprise users. Text analytics solutions have proved capable of addressing diverse challenges, and with business intelligence integration and embedding in line-of-business applications, the technology is poised for broad market adoption.
Stay tuned to my Business Intelligence Network Expert Channel on Text Analytics to learn more.
Recent articles by Seth Grimes