Oops! The input is malformed!
Originally published July 27, 2009
Does the expression “half a loaf is better than no bread” apply to information systems? Or, think of the old Orson Welles TV commercial, “We will sell no wine before its time.” For a 1.0 analytics-software release, when is it time? How good does a new release have to be? When it comes to text mining and analysis, what are the decision criteria?
I recently saw a demo of a new semantic annotation system. While the software was still a beta version, it appeared to be feature-complete and bug-free and capable of delivering potentially very useful semantic capabilities. Yet as-demoed, it had clear flaws. I saw design and implementation limitations that reduce accuracy (as measured by both precision and recall, by how exactly and how completely semantic annotations are applied). These limitations will diminish usability in ways that could most significantly affect the non-technical end users for whom the system was designed. Release as-is or reprogram? What would you do? I’ll provide concrete examples of ways this system (and others like it) could boost accuracy and usability, and then I’ll give you my call.
Semantic enrichment is a process of annotating text (or other data objects) – in practice, via tagging – in ways that boost the value of the text. Annotations applied at the document level might identify the author, title, publication date, and topics or themes in a fashion that enables machine classification of the documents and improves document findability for searchers. Call this type of annotation, whether automatically generated via text mining or human applied, “metadata.” Annotations applied at the feature level, to words or terms within the document, capture the meaning of those words and terms in machine-processable form. This latter type of semantic annotation can be used, for instance, to provide associated information or functions to users. For example, software could link to or provide a pop-up with an Oracle stock quote each time the ticker symbol ORCL is detected (and annotated) in a Web page.
How do we judge the correctness of an annotation in the face of ambiguity? Semantic annotation systems use a combination of lookups and statistical and linguistic pattern matching to discern meaning. While it’s not hard to tell a hawk from a handsaw, the clues that tell us whether “ford” indicates a president, an auto manufacturer, an actor, a theater, or a place to cross a river are often subtle.
The demo I recently saw correctly identified Wozniak (Steve, co-founder of Apple Computer – there, I’ve annotated his name for you in clear text) as a person, a person whose name appeared in a couple of forms (“Wozniak” and “Steve Wozniak") in an article that described his appearance at an Apple Store where he queued up to get a new iPhone 3G S. The vendor’s software did a credible job overall, yet it:
The vendor could improve its new semantic-annotation software. Taking my examples above in order, the vendor could:
By the way, I’ll mention one other possible issue exposed by the semantic-annotation demo. The software annotated everything on the Web page: the Apple Store/Wozniak article and also the page’s navigation elements and even the contextual advertising. I’m undecided whether the peripheral text should have been annotated although I will say that I would see search-engine indexing of ads as an information-retrieval data-quality problem.
So is half a figurative loaf better than no bread when we’re talking about information systems? My initial reaction, when I saw the semantic-annotation demo, was a clear No. Now I’m not so sure.
Yes, given possible – and feasible – improvements, another six (or so) months’ development, with feature and usability enhancements and perhaps even a rethinking of basic design points, would be in order. It’s not as if the vendor, so far as I can tell, is facing competitive pressures that are forcing premature release.
On the other hand, I can think of another analytics vendor that took two years to move a major – dot-zero – release from initial, limited roll-out to general availability. Agile methodologies for software development say you should release early and often with a process of continual improvement. (My counter-example vendor is large, very-well-known, and anything but agile.) Get your software out there. Let users bang on it and tell you what parts are great (and could be even better) and what elements don’t make the grade. And the validity of a technical critique doesn’t make an industry insider (like me) right when it comes to product management and marketing, as I have noted in the past.
My conclusion: When it comes to launch of new capabilities in a fast moving field – and no information-systems domain is evolving more rapidly than semantics – it's desirable, even important to get the technology out there where it can be seen and reacted to and in turn spur further innovation. Text is a tough information source; mining and semantic enrichment will never provide exact results. We can live with less than perfect accuracy. Even a half step forward brings us closer to our goals for analytics.
SOURCE: Half a Loaf Information Systems
Recent articles by Seth Grimes