Oops! The input is malformed!
Originally published May 13, 2008
There are many elements that affect text analytics’ accuracy. The most significant trace back to the nature of textual data sources. Text-sourced data differs from the data extracted from transactional and operational databases that is the basis of business intelligence (BI) work; it doesn’t come packaged in fielded form. Instead, sources are “unstructured,” and, further, the vast majority of them – news articles, blog and forum postings, email messages, contact center notes and transcripts, scientific papers, etc. – were designed for human communications rather than for computerized processing. Where the semantics – the meaning – of database tables and fields is implicit in the design of a database, the content of textual documents is almost never similarly organized to support computerized processing. The result is that text analytics has the difficult job of discerning and exploiting linguistic and statistical patterns in order to infer the semantics latent in textual sources. Only via these steps can we generate structure that makes text-sourced information amenable to analysis.
The business intelligence assumption is often that data quality must be 100%, that there can be no allowance for bad data. Yet this is an incorrect assumption, and the same is true for text-analytics accuracy: 100% accuracy is rarely an absolute requirement, and it is rarely attainable in any case.
For both business intelligence and text analytics, data quality/accuracy is good enough when it delivers results with a high enough degree of confidence. Conventional data analysis techniques allow for less than complete accuracy – in data collection, in data analysis, in interpretation – and the same latitude should apply for text analytics, for work with text-sourced data.
Consider typical text analytics applications, which involve either some facet(s) of classifying, querying, summarizing, translating, indexing and otherwise making sense of natural-language documents and their contents, or extraction and analysis of their contents, of the entities, concepts, facts and relationships, and sentiment found in textual sources. If I can infer the topic of an email message to my company – say, that a customer is dissatisfied with a product she purchased – and my purpose is to route messages of this type to a customer support representative for action, I don’t have worry about extracting detailed information from the message. The accuracy need is relative to the use to which analytical results will be put.
The process of designing for accuracy must therefore start with outputs, with an understanding of the decisions to be made and the findings that must be produced to support them. We will consider this process, and common sources of inaccuracy, but first let’s see how accuracy is measured for text technologies, the components of accuracy and how we can design for the appropriate level of accuracy in our text-analytics applications.
The accuracy of information retrieval (for instance, the results returned by a search) and of information extraction (where important entities, concepts and facts are pulled from “unstructured” sources) is typically measured by an f-score, a value based on two factors – precision and recall.
Precision is the proportion of information found that is correct or relevant. For example, if a Web search on “John Lennon” turns up 17 documents on Lennon and also 3 exclusively about Yoko Ono, who is of little interest but was associated with Lennon due to co-occurrence of the two individuals’ names in a large number of documents, then the precision proportion would be 17/20 or 85%.
Recall, by contrast, is the proportion of information found of information available. If there were actually 8 documents legitimately about John Lennon that were not found, perhaps because only a small portion of each was devoted to Lennon, leading to low “term density,” then the recall would be 17/25 or 68%.
If we insist on absolute precision, we’re likely to miss much relevant information (that is, to have many false negatives). If we insist on high recall – the trivial approach would be to bring back all available information with no discernment – then we risk diluting precision. So we incorporate both measures in the f-score, computed as the weighted harmonic mean of precision and recall. If we weight the two factors equally, we have the equation:
f = 2*(precision * recall) / (precision + recall)
How you weight precision and recall in computing an accuracy score depends on the nature of your application. If you’re looking for “needles in haystacks,” you need high recall levels and can perhaps tolerate lower precision, a larger proportion of false positives. Or, you may be interested in a broad, statistical characterization, in which case you might want higher precision at the cost of missing some data. What’s important is to match the weighting to business goals.
Other measures may apply to text-analytics problems other than information retrieval and information extraction. For instance, we might wish to understand how similar two documents are or how well a document fits the category in which it has been classified. In this case, we might use a vector-space model, where (distinctive) words and terms comprise the dimensions in a mathematical space. We gauge similarity, or closeness of fit, according to the angle between tuples of words/terms, which form vectors in the overall document space, from each of the documents or from a document and the centroid of a cluster.
These, however, are technical measures. Certainly they should be used, but for real-world problems, they are perhaps not enough. The most important measure of accuracy is one that assesses whether a solution adequately responds to the need. For instance, software might correctly identify a sentiment such as anger or frustration in a forum posting without determining what incident or other cause motivated that sentiment. Findings may be technically accurate but still incomplete and insufficient.
I wrote that the job of discerning and exploiting linguistic and statistical patterns in text is difficult. The nature and degree of difficulty varies widely depending on the type of text you’re working with. Material from quick and/or casual conversations is perhaps the most difficult. Go back to the Dell IdeaStorm.com example that I cited in my January article, “Dell really... REALLY need to stop overcharging... and when i say overcharing... i mean atleast double what you would pay to pick up the ram yourself.” Or, consider a bit of classic instant messaging text, “RU There PPL?” Even if our software can get past the abbreviations and irregular grammar, it still has to do with the question of who “I” and “you” and “people” (which are exophora, explained outside the text snippets) are referring to. That is, text analytics may be able to grasp the syntax without being able to get the semantics (the meaning).
We do have resources available. For instance, a lexicon that includes named entities such as “Dell” could help with the aforementioned problem, as could a taxonomy that classifies “RAM” as a hardware component. Similar resources are available for dealing with other sources of text such as:
Configuration steps can make a difference. Set up good “stop lists” of words and terms to ignore in text analyses. These may be common words such as “and” and “the,” but they may also include less common terms that, if not put aside, would distort results.
These resources and techniques won’t help with framing problems, that is, if the choice of source material is inadequate for the problem at hand. For instance, if you’re interested in customer experience, a website survey won’t reach customers who shop by phone or at a store location. Analysis of contact center transcripts and notes will provide useful information, but mostly from customers who are having service or product problems. So do consider the question: What do you want to learn?
An excellent way to boost accuracy of the overall solution is to apply a coordinated variety of methods, analyzing and correlating operational materials such as contact center records, active feedback from surveys, passive sources such as blog and forum postings, and transactional information such as web clickstreams and purchases, and other forms of customer interaction. Accuracy (in a broad sense) can also be improved via a process of data enrichment such as by linking demographic statistics to individual records in order to understand and form a more complete picture of customers. That is, it is often not enough to focus exclusively on the algorithm performance. The picture is much more complex.
Design your text analytics solution to deliver a level of accuracy that will help you achieve your business goals.
Tracing back from those goals, determine what information (data, facts, rules, etc.) you need to produce. Figure out what analytical processes can create the required information and what inputs and sources you’ll need to feed those processes. All of this is standard analytical methodology, but it applies especially where text is involved because textual sources are so varied in origin and form.
Secure access to the sources and, in the case of surveys and reporting and warranty forms and the like (these are essentially data-collection instruments), design them well. For example, if you’re drafting a survey or a feedback form, focus the questions and response fields on a narrowly defined topic or issue: “Was your room clean?” rather than, over-broadly, “How was your stay?” Narrow focus will boost your confidence that you’ve correctly assigned a response topic, a concern if you get terse, reference-less responses such as “dirty.”
Correlate with associated data. For example, responses to multiple-choice survey questions can help you in analysis of verbatims (free-text responses). For instance, a respondent’s rating of customer service on a scale of 1-poor to 5-excellent can indicate the “tonality” of a free-text response that follows: is it likely to contain complaints or praise?
If you’re looking for customer opinions on the Web, don’t look just at blogs – there may be millions of them out there now, but how many are your customers, writing about their experiences with your products? Look also at support forums, review sites and discussion lists (if you can get to them). Find ways to enrich text-analytics results by correlating data to other operational and transactional information sources.
Once you’ve determined what results you need to produce to support your business goals, and once you’ve secured access to the right sources and designed appropriate analytical processes, do validate those processes and verify (the correctness of) results using test inputs. And do design the processes to include accuracy assessments. When you deploy them, assess performance using appropriate accuracy measures.
In the end, keep in mind that accuracy is not an absolute, and 100% is rarely a real or realistic requirement. A variety of common-sense, best-practices techniques can help you ensure that your text-analytics solution produces the accuracy needed to meet business goals at a cost that is reasonable and appropriate.
Recent articles by Seth Grimes