For Text Data Quality, Focus on Sources
by Seth Grimes
Originally published December 9, 2009
I asserted in an earlier article that orthodox data-quality standards and approaches don’t apply to text. Text is noisy, full of errors, irregularities, and ambiguity. Sometimes you don’t even know what sources (i.e., datasets) and features (read: variables) matter, much less how content should best be handled (i.e., standardized, cleansed, etc.). Meaning is contextual rather than absolute, which means that source characteristics may affect text data quality and analytical accuracy. There can even be usable information in seeming noise – in misspellings and usage errors – that can be exploited for error detection and correction!
Text Source Best PracticesStart with basics. Just about every text-analytics effort is front-ended by an information retrieval (IR) step. The digital universe is huge. You need to harvest a working set of documents (of whatever form) for analysis. This may seem obvious, but you need to:
You might find it advantageous to work with content aggregators, particularly for news and possibly for social-media sources. In addition to harvesting and packaging content, they provide metadata and selection tools that can aid in cutting data volumes down to size. However you assemble content for analysis, focus is important – a high “signal to noise” ratio. (In this instance, by “noise” I mean material that is not topical or somehow else not useful.) Call this concept “information density.”
Continuing with definitions:
That natural-language text is chaotic complicates processing. The large volume of attitudinal information out on the Web and the especially complex nature of subjective material further complicate the job. I’ll clarify what I mean by “make proper use of source materials” with rules for, as an example, sentiment analysis done right. You should aim to:
A few, last best practices points, frequently although not universally applicable in the quest for text data quality:
Real-World WisdomI’ve offered a stream of insights harvested from conversations with text analytics practitioners and vendors, case studies, and my own work. At this point, I’ll present material from a couple of folks who are in the trenches dealing with quality issues.
SAS Institute has decades of experience helping customers deal with conventional data quality issues in constructing and applying statistical models for a broad range of data analysis needs. Many SAS customers have extended their analyses to text. Anne Milley, SAS’s director of technology product marketing, offers as one example of data quality issues related to text “a challenge in fraud detection with fraudsters purposely varying name and address to avoid being identified. With so much name and address standardization happening at the source, there is the potential for IT to over-clean the data, making it harder for analysts to really see what's happening further downstream.”
It would not be hard to come up with many other instances where material from relevant sources may be incorrect whether due to fraud or manipulation or passage of time. Preparation of source materials may introduce other errors. SAS’s Anne Milley cites an example provided by her colleague Manya Mayes – speech recognition that rendered “thank you very much” as “think ye retouch.” This situation occurred due to voice-to-text transcription errors; the situation is similar with optical character recognition (OCR) and automated translation between language pairs. In the latter case, the rule of thumb for estimation purposes is 10% error introduced when translating between grammatically similar languages such as Spanish and English.
Lastly, text-analytics vendor Netbase learned a very difficult lesson about quality, including that perceptions that arise from relatively superficial issues can have a disproportionate impact, when the company launched healthBASE, a healthcare search engine powered by the company's technology. (Daniel Tunkelang, Endeca's former chief scientist, who recently joined Google, described and analyzed the incident in an article posted to his blog, The Noisy Channel.) Vice President of Marketing and Product Strategy Jens Tellefsen very graciously agreed to explain the issues and their causes. I’ll relate what he had to say, verbatim:
We launched healthBASE to publicly demonstrate a new technology that semantically parses and indexes content to fully understand its meaning to deliver more relevant search. healthBASE is built on our Content Intelligence Platform, which has been deployed successfully in different domains by Fortune 1000 companies, global publishers, and the federal government over the last few years for a variety of strategic applications.
Our first release of healthBASE surfaced a few embarrassing and offensive bugs. These were far in the minority of results but enough to keep us working hard… improving the site. We deeply regret and sincerely apologize for any offense caused.
The good news is that the fix was both quick and simple. The site had not been configured to specify some input terms as nouns – “AIDS” rather than the verb “aids” – when calling our linguistics engine. The ability to use such distinctions is a fundamental capability of Content Intelligence, which is why the fix could be implemented and rolled into the site shortly after its discovery.
The other launch issue was the inclusion of Wikipedia. After some debate, we decided to include Wikipedia since it contains some good (albeit not validated) health information, recognizing that its very broad topic coverage would pull in some false information and some bizarre or irrelevant associations. It is now clear that this was confusing to users expecting authoritative results. We have since removed Wikipedia and added a notice to healthBASE to clarify that some less credible Web sources are included, that users should not expect medically validated facts from this site.
Finally, we just didn’t consider that users would be interested in running non-health-related searches on the site. We’ve started to filter input to mitigate the more offensive searches. In the mean time, we appreciate that folks might have a good laugh at the essentially random and often funny results that emerge from non-health queries…
We’ve learned a lot since the release and we’re excited to offer a showcase to demonstrate the power of this new technology. You will see improvements in the coming weeks. We appreciate the feedback. Please keep telling us what you think.
Perfection is ElusiveJens Tellefsen’s frank remarks reinforce a very important point that while text data quality is an important goal, perfection is elusive. To maximize quality, it is essential to understand challenges, related to both source characteristics and the use of text-sourced information, posed by text in its diverse forms. The variety and complexity of natural language make for information-rich sources while making materials difficult for software to decode. Jens’s observations, and Anne Milley’s, do relay a confidence that I share that text analytics is up to the job. Text can be tamed. Close attention to the selection, preparation, and processing of source materials, guided by best practices derived from theory and experience, is an important part of the text data quality work.
Recent articles by Seth Grimes
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC