Text Data Quality
by Seth Grimes
Originally published November 17, 2009
Data quality is a good thing, yet orthodox quality thinking doesn’t apply to text. It is (currently) impossible, with text, to achieve anything near the 100% definitional precision demanded by data quality purists. The problem is not only that quality steps designed for data in and from transactional and operational systems don’t extend to text sources. (I’m referring to data profiling, cleansing, and standardization with a central role for master data management and data governance.) Documents are different from databases to the point where conventional data quality steps may even be undesirable in work with text. Unlike data neatly stored in database fields, text-sourced data is ambiguous. Meaning is contextually dependent and often, further, is best construed in light of user intent. There are few absolutes. There are even fundamental questions about what constitutes useful, usable data. In the extreme case, given the chaotic but expressive “natural language” found in online forums, text messages and email, the irregularities – the seeming noise – may contain information that can fuel better, more responsive, more accurate information systems!
People Make MistakesThe basic text data quality issue is that humans make mistakes, and the challenge is that people’s natural-language mistakes defy easy, automated detection. Knot won of us hasn’t maid a correctly spelled writing error. Okay, that is a joke, but I can’t tell you how many times I’ve written “there” when I meant “their.” But consider another, real-world example, a response to a question in my 2009 text-analytics user survey, “Please describe your overall experience – your satisfaction – with text analytics:”
“OK, it is hard to describe satisfaction of using text analytics tools when we all know how language is ambiguous and complex – we cannot expect too much from automatic processing yet, maybe in the time when neutral networks can be used, but NLP on its own cannot impress us yet I think.”
Manfred Pitz, TEMIS sales director in Germany, wrote me to point out, “Could it be that ‘neutral networks’ needs to be corrected into ‘neural networks’?” Of course he’s right. This particular problem would be very difficult for software to detect, especially because “neutral” is an adjective, so there’s nothing syntactically (grammatically) incorrect about the phrase “neutral networks can be used.” I suppose a statistical model, built from a large document corpus, could flag “neutral networks” as an outlier, a very uncommonly found term, noting also that the context discusses NLP and analytics.
One would have to go to great lengths to automatically detect this particular problem and others like it. Beyond detection, systematically correcting all such problems in the name of sacrosanct data quality notions would likely take extraordinary resources and cost far more than correction is worth.
Lesson #1 is that text data quality issues are subtle and correcting them may be expensive beyond any real need as determined by business goals.
By the way, did you catch the usage “in the time when,” where the non-idiomatic use of the article “the” suggests that the writer is not a native English speaker? Interesting, but probably inconsequential. Simply put: not every quality issue matters, nor, for that matter, does every bit of information.
Context Counts“Neutral” and “neural,” words relevant to the previous example, make sense in different contexts, as do polysemous terms such as NLP, which in the text-analytics context stands for natural language processing. For many people, NLP abbreviates neuro-linguistic programming. (“Polysemous” is linguistics-speak for “having multiple senses or meanings.”) Context is key to deciding which meaning is correct or best. A second quality challenge, beyond discerning and dealing with error, is correctly handling good data.
I often use “ford” as an example. “Ford” may variously be a U.S. president or the same person as a college football player, member of the House of Representatives, or vice president; the theater where a president was shot; a movie star (Harrison and Glenn); an auto manufacturer or the name of the family that founded and ran the company; a shallow place you can cross a river without a bridge or boat; or the act of crossing an unbridged river. You get the idea: text-sourced information resists neat classification. It can be hard to even determine a data value’s (reference) category, whether it’s a person, a company, a building, or a geographic feature. It is certainly hard to do these things accurately.
Text analytics can infer context from content in order to make best automated decisions about meaning. The methods are going to be fuzzy, accommodating uncertainty, unlike preferred exact methods in the data data quality world.
Lesson #2 is that meaning in text is hard to pin down – it's often fuzzy and indeterminate, contextually based – to the point where traditional data quality measures do not apply.
Judgment CallsData processing and quality work always involves judgment calls, cases that could go one way or another, where a perhaps seemingly arbitrary decision has to be made. I’d say these situations are far more common in the text world than for structured data due to the ambiguity of natural language. Take stemming decisions as an example: the normalization of word forms.
“Antiauthoritarian” and “unauthorized” have the same root, but I’d venture that if you’re mining a document set to build a bibliography, you’d want to avoid for classification purposes reducing either word to the shared root, “author.” By contrast, the relation between notary and notarize is regular, as Bob Carpenter notes in a from-the-trenches, technical discussion of the problem in his blog article, “To Stem or Not to Stem?” Carpenter reports having found nearly 100 forms of “author” in 5GB of newspaper text.
So do you, say, just reduce every word to its first five characters, or do you more elegantly and aggressively remove prefixes and suffixes to get to word stems? Perhaps you should even ignore words forms and get to word identities via an approach such as statistical clustering based on co-occurrence or co-reference. In the end, the choice of approach should correspond to the business problem at hand. If you want high precision (few false positives), you might be very conservative about stemming. If you want high recall (few missed cases), you might stem very aggressively.
Challenge #3 is understanding what you want to get out of data, text-sourced or more conventional, and designing appropriate processing steps. Lesson #3 is that we shouldn’t put the figurative text data quality cart before the analytical horse. Text processing decisions are driven by our understanding of search or analytical goals rather than by a priori quality dictates.
Learning from MistakesSo not every text feature – not every potential text-sourced data item nor every potential text data quality issue – merits detection and use or correction. The picture is further confused: variations and irregularities may not only not be errors, they may convey information. A grammarian will tell you that the capitalization in the following text is irregular, but any human reader knows that the capitalization and repeated exclamation points intensify the poster’s message:
“We only have two words for this hotel: STAY AWAY!!”
Even seeming misspellings may be slang or dialect, so if you correct “phat” to “fat” or “pfat” to “phat,” then “Lucy, you've got some 'splainin' to do!”
It’s great when software can understand the difference between errors and information… and when there’s information in errors. Send “splanin” to Google, and you get links to sources. If you send “splaining” with a “g” on the end to Google, you’re asked,
Did you mean: splainin
Clearly Google recognizes “splainin” (with or without the apostrophes that indicate elisions) as good text, and it recognizes “splaining” as a more probable error. Google is offering one example of a query reformulation, enabled by a bit of analytical judo that exploits responses to identified issues to offer potential corrections. (Judo’s “soft method,” according to Wikipedia, “is the principle of using one's opponent's strength against him and adapting well to changing circumstances.) In examples such as this one, I’d infer that people click the “did you mean” text often enough for the search engine to associate the erroneous spelling with the correct one, irregular as the correct one is in this particular case.
Indeed, Prof. Marti Hearst writes in her book Search User Interfaces, which was published this last summer,
“Search logs suggest that from 10-15% of queries contain spelling or typographical errors. Fittingly, one important query reformulation tool is spelling suggestions or corrections.”
Text data quality lesson #4 is that irregularities – the seeming noise – may contain information that can fuel better, more responsive, more accurate information systems.
Getting RealMarti Hearst explains,
“Query logs often show not only the misspelling, but also the corrections that users make in subsequent queries. For example, if a searcher first types schwartzeneger and then corrects this to schwartzenegger, if the latter spelling is correct, an algorithm can make use of this pair for guessing the intended word. Experiments on algorithms that derive spelling corrections from query logs achieve results in the range of 88-90% accuracy for coverage of about 50% of misspellings.”
This quotation brings us back to my opening claim that it is (currently) impossible, with text, to achieve anything near the 100% definitional precision demanded by data quality purists. Ninety percent accuracy in about half of cases!? (Accuracy and quality are tightly linked.) That’s the world of text analytics, of text data quality. In some situations you’ll do better, sometimes much better; but in others, you’ll do worse.
I’ll reiterate the four text data quality lessons I’ve offered and I’ll add a fifth lesson drawn from the accuracy picture.
SOURCE: Text Data Quality
Recent articles by Seth Grimes
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC