Half a Loaf Information Systems

Originally published July 27, 2009

Does the expression “half a loaf is better than no bread” apply to information systems? Or, think of the old Orson Welles TV commercial, “We will sell no wine before its time.” For a 1.0 analytics-software release, when is it time? How good does a new release have to be? When it comes to text mining and analysis, what are the decision criteria?

I recently saw a demo of a new semantic annotation system. While the software was still a beta version, it appeared to be feature-complete and bug-free and capable of delivering potentially very useful semantic capabilities. Yet as-demoed, it had clear flaws. I saw design and implementation limitations that reduce accuracy (as measured by both precision and recall, by how exactly and how completely semantic annotations are applied). These limitations will diminish usability in ways that could most significantly affect the non-technical end users for whom the system was designed. Release as-is or reprogram? What would you do? I’ll provide concrete examples of ways this system (and others like it) could boost accuracy and usability, and then I’ll give you my call.

Semantic Enrichment

Semantic enrichment is a process of annotating text (or other data objects) – in practice, via tagging – in ways that boost the value of the text. Annotations applied at the document level might identify the author, title, publication date, and topics or themes in a fashion that enables machine classification of the documents and improves document findability for searchers. Call this type of annotation, whether automatically generated via text mining or human applied, “metadata.” Annotations applied at the feature level, to words or terms within the document, capture the meaning of those words and terms in machine-processable form. This latter type of semantic annotation can be used, for instance, to provide associated information or functions to users. For example, software could link to or provide a pop-up with an Oracle stock quote each time the ticker symbol ORCL is detected (and annotated) in a Web page.

How do we judge the correctness of an annotation in the face of ambiguity? Semantic annotation systems use a combination of lookups and statistical and linguistic pattern matching to discern meaning. While it’s not hard to tell a hawk from a handsaw, the clues that tell us whether “ford” indicates a president, an auto manufacturer, an actor, a theater, or a place to cross a river are often subtle.

Evaluating a Demo

The demo I recently saw correctly identified Wozniak (Steve, co-founder of Apple Computer – there, I’ve annotated his name for you in clear text) as a person, a person whose name appeared in a couple of forms (“Wozniak” and “Steve Wozniak") in an article that described his appearance at an Apple Store where he queued up to get a new iPhone 3G S. The vendor’s software did a credible job overall, yet it:

  1. Did not reduce multiple forms of a given name to a single version. When the product manager, as part of the demo, pulled up the annotations for “Wozniak,” almost all the information that came up was about a tennis player, Aleksandra Wozniak, who has been cited extensively in the press lately.

  2. Did not recognize “Apple Store” as an entity in itself, distinct from “Apple.” Even if “Apple Store” wasn’t identified via a named-entity lookup in a competitive-intelligence lexicon or taxonomy, the capitalization of “Store” is a clear clue that there’s more to the entity in question than the very ambiguous “Apple.” Similarly, the software separately annotated “San Jose, California” as “San Jose” and “California.” While California seems relatively unambiguous, how many organizations have “California” as part of their names? And how many “San Jose” cities are in this world? By contrast, “San Jose, California” is completely unambiguous. "San Francisco County" and "Apple Computer Inc." – in both cases, the last word was excluded from the annotations – were other examples of partial annotation of a larger term in the demo.

  3. In a supplementary document linked to the annotated word “Apple,” incorrectly annotated only a non-existent “Exchange Commission,” ignoring the preceding “Securities And,” which was on a separate line in the source PDF file.

  4. Did not recognize “iPhone 3G S” or even just “iPhone” as a product – one that gets lots of attention nowadays – but did recognize “MacBook” as a person name, I suspect based on a “Mac [capitalized word]" pattern. It similarly did not recognize, in parsing the SEC document, the very clear and distinctive reference, “California Business & Professional Code §16700 et seq.,” which should be annotated by any system that sees fit to bring in documents of that nature. (I don’t know of any context other than legal reference where the section symbol “§” is used.)

  5. Runs in only one Web browser for now – support for others is planned – and that via a browser alteration that could prove intrusive for casual users who use the software infrequently and would possibly prove an unacceptable browser enhancement in many government and corporate settings. The alternative would be a Java or other applet that runs in-page rather than via a browser alteration.

The vendor could improve its new semantic-annotation software. Taking my examples above in order, the vendor could:

  1. Implement “term reduction” techniques that reduce variant representations of a given feature to a single, canonical form. Given the co-occurrence of “Wozniak” and “Steve Wozniak” in the Web article parsed in the demo (and no mention of tennis), “Wozniak” could and should have been annotated with Woz’s full name.

  2. Program “greedy” (or maximal) pattern matching that preferentially annotates the longest feature found in a sequence of words, e.g., “Apple Store” rather than “Apple,” and even multiple annotation of compound features composed on distinctly recognizable terms, that is, of “San Jose” and “California” AND “San Jose, California.” I actually see a second route to improvement given the current annotation approach. When the user right-clicks on (for example) “Wozniak,” pass the whole page and not just the single term back to the server, which would use the whole of the page’s content to disambiguate the term and provide the appropriate links and supplemental information. An article that is clearly about Apple and that in other occurrences associates a particular first name to “Wozniak” provides plenty of contextual clues that a good semantics vendor – and this one does belong to that category – will be able to use.

  3. Better handle layout issues. “Securities and Exchange Commission,” despite the line break, was clearly part of a document title that was in a distinct font and offset from surrounding text.

  4. Either a) better train the software, b) make vocabulary and other limitations clear, or c) not harvest materials like the SEC document that the software can’t completely parse and analyze. In the vendor’s defense, it was just a demo that I saw. The vendor’s software is open and extensible. Implementers (as opposed to end users) can add their own vocabularies and pattern-match rules.

  5. Consider potential deployment issues in light of end users’ work environments.

By the way, I’ll mention one other possible issue exposed by the semantic-annotation demo. The software annotated everything on the Web page: the Apple Store/Wozniak article and also the page’s navigation elements and even the contextual advertising. I’m undecided whether the peripheral text should have been annotated although I will say that I would see search-engine indexing of ads as an information-retrieval data-quality problem.

Is Half a Loaf Better than None?

So is half a figurative loaf better than no bread when we’re talking about information systems? My initial reaction, when I saw the semantic-annotation demo, was a clear No. Now I’m not so sure.

Yes, given possible – and feasible – improvements, another six (or so) months’ development, with feature and usability enhancements and perhaps even a rethinking of basic design points, would be in order. It’s not as if the vendor, so far as I can tell, is facing competitive pressures that are forcing premature release.

On the other hand, I can think of another analytics vendor that took two years to move a major – dot-zero – release from initial, limited roll-out to general availability. Agile methodologies for software development say you should release early and often with a process of continual improvement. (My counter-example vendor is large, very-well-known, and anything but agile.) Get your software out there. Let users bang on it and tell you what parts are great (and could be even better) and what elements don’t make the grade. And the validity of a technical critique doesn’t make an industry insider (like me) right when it comes to product management and marketing, as I have noted in the past.

My conclusion: When it comes to launch of new capabilities in a fast moving field – and no information-systems domain is evolving more rapidly than semantics – it's desirable, even important to get the technology out there where it can be seen and reacted to and in turn spur further innovation. Text is a tough information source; mining and semantic enrichment will never provide exact results. We can live with less than perfect accuracy. Even a half step forward brings us closer to our goals for analytics.

SOURCE: Half a Loaf Information Systems

  • Seth GrimesSeth Grimes

    Seth is a business intelligence and decision systems expert. He is founding chair of the Text Analytics Summit and principal consultant at Washington, D.C., based Alta Plana Corporation. Seth consults, writes, and speaks on information-systems strategy, data management and analysis systems, IT industry trends, and emerging analytical technologies. Seth chairs the Sentiment Analysis Symposium and the Text Analytics Summit.

    Editor’s Note: More articles and resources are available in Seth's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Seth Grimes



Want to post a comment? Login or become a member today!

Be the first to comment!