For Text Data Quality, Focus on Sources

Originally published December 9, 2009

I asserted in an earlier article that orthodox data-quality standards and approaches don’t apply to text. Text is noisy, full of errors, irregularities, and ambiguity. Sometimes you don’t even know what sources (i.e., datasets) and features (read: variables) matter, much less how content should best be handled (i.e., standardized, cleansed, etc.). Meaning is contextual rather than absolute, which means that source characteristics may affect text data quality and analytical accuracy. There can even be usable information in seeming noise – in misspellings and usage errors – that can be exploited for error detection and correction!

Text sources are many and varied: websites, email and contact-center notes, traditional (edited) media as well as social media and forums, scientific and technical literature , legal documents, warranty and insurance claims, reference materials and reports, and so on. It’s almost self-evident that choice and treatment of sources is critical to text data quality, yet, as of now, there are few standard practices for choosing and working with sources. In contrast with the enterprise data world, where data for analysis has been integrated, cleansed, and harmonized in a data warehouse or via enterprise information integration techniques, I’d call text largely ungovernable and say that there’s little by way of “master data” to manage.

There are, nonetheless, emerging best practices for selection and processing of text sources that can lead to better text data quality. I’ll cover them, informally, and then in the next section, I’ll relay real-world wisdom from a couple of industry authorities.

Text Source Best Practices

Start with basics. Just about every text-analytics effort is front-ended by an information retrieval (IR) step. The digital universe is huge. You need to harvest a working set of documents (of whatever form) for analysis. This may seem obvious, but you need to:
  • Choose the right sources.
The selection process is hugely important as we’ll see from a look in the next section at a system called healthBASE. A corollary principle is:
  • That a source is available doesn’t mean it’s right for the job.
How do you choose, and how do you exclude? Source selection criteria include topicality, focus, currency, and authority and also your processing capabilities and analytics needs. What do I mean by these criteria? Some definitions:
  • Topicality is simply a judgment that the source contains the information you need… and enough of that information to justify mining it.
If you’re looking for consumer sentiment, for instance, don’t search the open Web. Identify forums and blogs with the right focus such as, for travel and restaurants, TripAdvisor, FlyerTalk, Yelp, and Zagat. Focus your source selection. If you’re doing competitive intelligence for burger joints in the mid-Atlantic region, you probably don’t need to look at attitudes toward In-N-Out or Dick’s Drive In.

You might find it advantageous to work with content aggregators, particularly for news and possibly for social-media sources. In addition to harvesting and packaging content, they provide metadata and selection tools that can aid in cutting data volumes down to size. However you assemble content for analysis, focus is important – a high “signal to noise” ratio. (In this instance, by “noise” I mean material that is not topical or somehow else not useful.) Call this concept “information density.”

Continuing with definitions:
  • Currency is the timeliness of information, a measure whether, for a particular task, the information is out of date.

    Responsive brand and reputation management, for instance, require immediate access to Twitter streams, blogs and other social media, and forum postings that are as close to real-time as possible. Items that are even a few hours old may be little more than noise.

  • Authority is the trustworthiness of an information source.

    Formal publications are (typically) authoritative, as are primary materials such as blogs, survey responses, and email messages. Yet email and blog spam, websites that simply aggregate links, non-original reprints of syndicated content, and even text quoted from previous messages in an email or forum thread may be considered non-authoritative content. Don’t rush to filter them however. Information-flow patterns may be useful information for studies of opinion diffusion in social media, for influence analysis, even if the non-original repetitions aren’t germane in getting to authoritative, text-sourced information.

    Do note that authority is not universal, it may be time-linked, and it is not the same as correctness. For example, George W. Bush is an authority on life in the White House during his administration but (as we have seen) not on economics, and despite his presidential authority, his assertions about Iraqi acquisition of uranium were not correct.
Whatever your sources, you’d do well to:
  • Generate and retain provenance metadata, attached to documents and to extracted information, that describes sources.
Regarding processing capabilities: The best of source materials are valueless if you can’t make proper use of them. It’s better to scale back goals than to undertake a project you can’t do well. This principle may mean sampling from sources rather than acquiring and loading a complete document set. It may mean analyzing abstracts rather than full articles. It may mean foregoing sources in a language your software can’t handle.

That natural-language text is chaotic complicates processing. The large volume of attitudinal information out on the Web and the especially complex nature of subjective material further complicate the job. I’ll clarify what I mean by “make proper use of source materials” with rules for, as an example, sentiment analysis done right. You should aim to:
  • Distinguish extracted fact from opinion.

  • Distinguish the opinion holder from the sentiment object.
For example, the sentence, “Treasury Secretary Timothy F. Geithner acknowledges that the federal budget deficit, at $176.36 billion for October 2009, is too high,” mixes fact and opinion. In this example, the sentiment “too high” obviously applies to the federal budget deficit and not to the treasury secretary, the opinion holder, himself.

A few, last best practices points, frequently although not universally applicable in the quest for text data quality:
  • Clean your sources; for instance, separate Web page content for analysis from ads and navigation elements.

  • Retain access to (indexed and/or annotated) full-text sources, which will help with analytical functions such as root-cause analysis and will allow reprocessing should analytical needs change.

Real-World Wisdom

I’ve offered a stream of insights harvested from conversations with text analytics practitioners and vendors, case studies, and my own work. At this point, I’ll present material from a couple of folks who are in the trenches dealing with quality issues.

SAS Institute
has decades of experience helping customers deal with conventional data quality issues in constructing and applying statistical models for a broad range of data analysis needs. Many SAS customers have extended their analyses to text. Anne Milley, SAS’s director of technology product marketing, offers as one example of data quality issues related to text “a challenge in fraud detection with fraudsters purposely varying name and address to avoid being identified. With so much name and address standardization happening at the source, there is the potential for IT to over-clean the data, making it harder for analysts to really see what's happening further downstream.”

It would not be hard to come up with many other instances where material from relevant sources may be incorrect whether due to fraud or manipulation or passage of time. Preparation of source materials may introduce other errors. SAS’s Anne Milley cites an example provided by her colleague Manya Mayes – speech recognition that rendered “thank you very much” as “think ye retouch.” This situation occurred due to voice-to-text transcription errors; the situation is similar with optical character recognition (OCR) and automated translation between language pairs. In the latter case, the rule of thumb for estimation purposes is 10% error introduced when translating between grammatically similar languages such as Spanish and English.

Lastly, text-analytics vendor Netbase learned a very difficult lesson about quality, including that perceptions that arise from relatively superficial issues can have a disproportionate impact, when the company launched healthBASE, a healthcare search engine powered by the company's technology. (Daniel Tunkelang, Endeca's former chief scientist, who recently joined Google, described and analyzed the incident in an article posted to his blog, The Noisy Channel.)  Vice President of Marketing and Product Strategy Jens Tellefsen very graciously agreed to explain the issues and their causes. I’ll relate what he had to say, verbatim:

We launched healthBASE to publicly demonstrate a new technology that semantically parses and indexes content to fully understand its meaning to deliver more relevant search. healthBASE is built on our Content Intelligence Platform, which has been deployed successfully in different domains by Fortune 1000 companies, global publishers, and the federal government over the last few years for a variety of strategic applications.

Our first release of healthBASE surfaced a few embarrassing and offensive bugs. These were far in the minority of results but enough to keep us working hard… improving the site. We deeply regret and sincerely apologize for any offense caused.

The good news is that the fix was both quick and simple. The site had not been configured to specify some input terms as nouns – “AIDS” rather than the verb “aids” – when calling our linguistics engine. The ability to use such distinctions is a fundamental capability of Content Intelligence, which is why the fix could be implemented and rolled into the site shortly after its discovery.

The other launch issue was the inclusion of Wikipedia. After some debate, we decided to include Wikipedia since it contains some good (albeit not validated) health information, recognizing that its very broad topic coverage would pull in some false information and some bizarre or irrelevant associations. It is now clear that this was confusing to users expecting authoritative results. We have since removed Wikipedia and added a notice to healthBASE to clarify that some less credible Web sources are included, that users should not expect medically validated facts from this site.

Finally, we just didn’t consider that users would be interested in running non-health-related searches on the site. We’ve started to filter input to mitigate the more offensive searches. In the mean time, we appreciate that folks might have a good laugh at the essentially random and often funny results that emerge from non-health queries…

We’ve learned a lot since the release and we’re excited to offer a showcase to demonstrate the power of this new technology. You will see improvements in the coming weeks. We appreciate the feedback. Please keep telling us what you think.

Perfection is Elusive

Jens Tellefsen’s frank remarks reinforce a very important point that while text data quality is an important goal, perfection is elusive. To maximize quality, it is essential to understand challenges, related to both source characteristics and the use of text-sourced information, posed by text in its diverse forms. The variety and complexity of natural language make for information-rich sources while making materials difficult for software to decode. Jens’s observations, and Anne Milley’s, do relay a confidence that I share that text analytics is up to the job. Text can be tamed. Close attention to the selection, preparation, and processing of source materials, guided by best practices derived from theory and experience, is an important part of the text data quality work.

  • Seth GrimesSeth Grimes

    Seth is a business intelligence and decision systems expert. He is founding chair of the Text Analytics Summit and principal consultant at Washington, D.C., based Alta Plana Corporation. Seth consults, writes, and speaks on information-systems strategy, data management and analysis systems, IT industry trends, and emerging analytical technologies. Seth chairs the Sentiment Analysis Symposium and the Text Analytics Summit.

    Editor’s Note: More articles and resources are available in Seth's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Seth Grimes

 

Comments

Want to post a comment? Login or become a member today!

Posted December 10, 2009 by Justin Langseth

Seth, very interesting article.  

At Clarabridge we have encountered similar challenges as we have added Social Media-sourced content into Clarabridge text analytics systems for our customers.  

In the Social Media sphere there are multiple related issues.  Advertisements pollute otherwise useful results, but can be largely removed via linguistic detection.  Spam also creeps through the typically-used statistical spam filters, but can also be removed by looking for linguistic fingerprints that are common of spam, such as a very low ratio of grammatical connections per sentence, and often a high average per-sentence word count.

Then there is the issue of the off-topic comments.  If someone wants to analyze a mid-Atlantic burger chain, even postings that actually say something useful about that chain may also talk about other aspects of the person's day to day activities.  The burger-related post may be 3 sentences out of a 30-sentence blog that talks about various other companies and activities as well.  This is another area where text analytics can greatly help, to separate the the on-topic "wheat" (from the specific analyst's perspective that is)  from the unrelated "chaff."

So thank you for raising these important concerns.  These are areas where collaboration and best-practice sharing could benefit us all.

Justin Langseth, President & CTO, Clarabridge, Inc. 

Is this comment inappropriate? Click here to flag this comment.