Originally published June 7, 2011
Research Magazine, a British publication, has published a text-analytics overview, Write here, write now, with comments from a number of industry and analyst sources, myself included. Author Paul Golden did a nice job with the story, yet there's more to be said – about text analytics and sentiment analysis, accuracy and best practices – than could be captured in a single, multi-sourced article. To help readers dig a bit deeper, I'll share the full text of my responses to Paul's interview questions.
How exactly do text analytics tools measure sentiment and how do they deal with variables such as irony or sarcasm?
It's complicated, a multi-step process. First, you need to define and detect the things you want to measure. You're interested in sentiment, but about what? A product, service, or brand? Your own or also competitors'? Just at a high-level, or also details, for instance, not just hotel chain X but more particularly room cleanliness, staff friendliness, food, amenities, and price for a particular property? In the other direction, general market conditions that might affect your company? Detect via a monitoring solution that recognizes names, terms, and concepts (call them, collectively, features) and then further uses natural language processing (NLP) to associate sentiment and other attributes to the features. And don't look only for "polarity," for positive/negative/neutral sentiment if the business problem at hand requires a different sentiment alignment – for instance, if you're working customer service, emotional tone such as angry, happy or sad.
Once you've detected, you can create aggregate measures, plot and compare trends, and so on.
Tackle these not-so-basic basics first before you take a shot at complexities such as irony and sarcasm, which is very difficult to decode systematically. Very often, humans don't even get it. If you do want to automate, you're almost certainly going to need linguistic techniques that match word use and patterns to vocabulary and phrases that indicate irony and sarcasm. Techniques are pretty much still in the research realm.
Are these tools capable of supporting all the languages and dialects spoken online?
Any language or dialect can be analyzed via statistically rooted techniques, given a sufficiently large corpus (document set). In practice, however, you need to add in linguistic techniques – dictionaries, part-of-speech resolution, syntax rules, lexical patterns – and machine learning, with adaptation to domain vocabularies, to achieve the highest level of accuracy and usefulness. No one would expend the time, effort, and funds to create solutions without a strong business case, and that means that solutions are much better developed for American English and a few other languages than for, say, Laotian or Xhosa.
Don't expect that you can get good results by simply translating into English, analyzing there, and translating back into the source language. Sentiment expressions don't translate well.
Do as much as you can. For a less-used language, start by creating a lexicon of sentiment-bearing words – "like," "love," "hate," "bad" – in the target language, and use it to detect sentiment for further analysis by a person. A part-way automated solution of this nature will surely be better than human-only analysis, if only in its reach.
How accurate is this measurement and to what extent are false positives/false negatives an issue?
A focus on accuracy, in the sense of "you have to reach X% or the solution is useless," can be a huge red herring. Accuracy should be enough to help you solve the business problem at hand. For a counter-terrorism application, you'd aim for 100% "recall" with a high tolerance for low precision and false positives (which you'd screen out via human review), while for another application, you may be willing to miss many cases in order to gain very high "precision."
This said, I'd venture that general-purpose automated tools, untrained and untuned, typically hit only 50% or so accuracy. But tools designed for particular business needs and domains will do much better out of the box, with accuracy that starts above 80%.
By the way, the best academic study I've seen showed only 82% agreement between two human sentiment annotators, rising to 90% if uncertain cases are removed. Humans aren't perfect!
How do text analytics tools deal with the slang/vernacular/abbreviations that are commonplace online?
Specialization and training. Text analytics tools deploy domain-specific vocabularies, thesauruses, and language-rule sets to decode online chatter, and they apply machine-learning techniques to fill in gaps. There's generally a lot of room for improvement, however.
Can text analytics tools distinguish between genuine online communication and automated communications such as spam?
Spam detection is an active research problem with several commercial implementations. The basic approach is to look for anomalous language patterns such as repeated use of unusual phrases and behaviors such as the creation of accounts, with the IP (Internet) address resolved to a particular geographic area, that post only a certain type of review. It's all about pattern detection.
Do text analytics tools tell you anything about the person responsible for the comment?
Text analytics can tell you about a poster; again, an active research area. First thing you want to do, however, is look at some of the "metadata" associated with a comment: the handle, ID, or email address of the poster and the information in an associated profile, which may include name, age, sex, location, and a URL – everything you need to know.
When explicit information isn't present, or when you want to corroborate or supplement that information, certain tools can infer a poster's identity profile and psychological profile via language analysis. Our choice of words, expressions, abbreviations, and even cultural reference says a lot about who we are and where we're from.
Is there a danger that because of re-tweets and forwarded emails, these tools end up analyzing the same text numerous times?
Data quality is a challenge in many business situations. Data profiling, deduplication, and cleansing are needed for text, just as for "structured" data records collected by enterprise operational systems.
Email, forum postings, and other threaded conversations often quote other messages, but with sufficient clues (such as a start-of-line ">" character) to allow quoted material to be deleted. On Twitter, a retweet is indicated by "RT" or by a Twitter-generated marker. Patterns such as these allow you to automate deduplication and cleansing.
By the way, if you're studying the diffusion of messages and information across online and social platforms, these clues will prove invaluable.
Is there still a role for insight professionals in making sense of data generated from automated analysis?
Every technologist with any practical experience will tell you that human-machine hybrids are the way to go. Use machines for their speed, reach, scalability, consistency, and other beyond-human capabilities. Use humans to train, guide, and oversee automated systems and interpret findings.
Modern automobiles were invented something like 125 years ago, and they still don't drive themselves. Automated text analytics was invented in the late 1950s but still, in most applications, also has a ways to go before it can operate accurately without human oversight.
Recent articles by Seth Grimes