Blog: Seth Grimes Subscribe to this blog's RSS feed!

Seth Grimes

Welcome to my BeyeNETWORK Blog, which will focus on text analytics and other matters related to making sense of unstructured information sources in support of better enterprise decision making.

About the author >

Seth is a business intelligence and decision systems expert. He is founding chair of the Text Analytics Summit and principal consultant at Washington, D.C., based Alta Plana Corporation. Seth consults, writes, and speaks on information-systems strategy, data management and analysis systems, IT industry trends, and emerging analytical technologies. Seth chairs the Sentiment Analysis Symposium and the Text Analytics Summit.

Editor’s Note: More articles and resources are available in Seth's BeyeNETWORK Expert Channel. Be sure to visit today!

July 2009 Archives

I recently received an inquiry from a student at a European management school who is writing a thesis about the relationship between search technology and business intelligence. She sees the two technologies as having a meeting point at text analytics and asked to pose a few questions on the topic. Many folks share her interest so my BeyeNETWORK blog seemed like a great place to share my responses. Here goes!

Management student> I have been struggling differentiating some terms and understanding them more clearly. Therefore, my questions are related to that confusion. I would also like to hear your opinion on these two technologies (BI as software and enterprise search) and their uses of text analytics.

MS> What is the difference between text analytics and text mining? Is it related to structured vs. unstructured data? Or is text mining a subset of text analytics?

Seth> There isn't a significant difference. I find that text mining is used in areas that have applied the technology longer and that apply data mining. Examples include life sciences and intelligence (e.g., counter-terrorism). Text analytics is more often used in business.

MS> Is content analysis the same as text analysis (if we look at textual documents, not rich data)?

Seth> To me, "content" generally indicates managed information that is typically found in a repository and that is often published on the Web. In this sense, e-mail and IM messages, survey responses, contact center notes and transcripts, and other forms of text generated during business operations are not content. In this sense, content analysis that concerns text is a subset of text analysis.

But "content" does also cover video, audio, and other media as you note. Content analysis would include these forms where text analysis wouldn't, as you understand, beyond work with textual tags.

MS> Is there a difference between text analytics done by search technology and BI applications?

Seth> Text analytics that backs up search is meant to support information retrieval: indexing, summarizing, and ranking documents in response to a search query. TA enables semantic indexing by topics and themes and relationships in order to go beyond indexing based solely on keywords. TA in support of search can also enable smarter, and natural-language, query processing. The example I'll give is that you can enter "map oslo" in Google and get a map of Oslo, because Google is doing a combination of named entity recognition for the geographic area, Oslo, and pattern matching that understands that "map " is a request for a map.

TA in BI (outside use of search for BI) is different. A complete definition of BI include treatment of information in textual and other forms, in databases, repositories, and on the Web. Search is a BI tool, and so is information extraction (a text analytics technique; information = entities, facts, topics, themes, etc.) into structured databases -- some see IE from text as equivalent to ETL for traditional databases -- and also analysis in the sense of data mining of text-extracted information. So when, for instance, you visualize a relationship network that includes people, companies, etc., based on text-extracted named entities and links (relationships, events, etc.), that's TA at work for BI.

MS > What are the fields that use text analytics the most? (any industries in particular?)

Seth> Life sciences and intelligence (including counter-terrorism) were the earliest use cases with serious work going back to the late '90s and they're still very strong domains for TA. But now we're seeing use in a spectrum of business applications as well.

Seth> Let me refer you for this question and the next to a report I recently published, which you can download for free at .

MS> How would you describe text analytics market?

[Seth> In my paper, I estimate a 2008 diversified, global market for text-analytics software and vendor provided professional services at $350 million, representing 40% growth from 2007. I foresee sustained growth rates of up to 25% for 2009.]

MS> There is a lot of talk about eDiscovery where text analytics plays a crucial role, but it is also one of the main markets for search technology. Are these two technologies (is it ok to call text analytics is a technology?) coming together?

Seth> I believe that in e-discovery, the principal application of TA is (still) in support of search in the sense that I wrote about above, creating richer indexes that allow legal researchers (litigants) to respond faster and comprehensively to discovery mandates. TA is only starting to be used by legal professional for investigatory purposes, for what you could call "making the case." Compliance and fraud investigations, and risk management, are starting points in this type of use. But I don't think the technology is being used systematically by litigators yet. I do think we'll see a lot more of this investigatory type of use.

I hope you've found our Q&A useful! As always, if you have questions or comments, do get in touch.

Posted July 30, 2009 1:13 PM
Permalink | No Comments |

Covering text analytics software, market, conference, and other news and developments to help KDnuggets readers better understand advances in Knowledge Discovery in Text...


Orchestr8 released on June 18 a significant upgrade to its AlchemyAPI content analysis online service. According to the company, the update includes expanded language coverage (adding Portuguese and Swedish), enhanced text categorization, and integration with Linked Data standards. "AlchemyAPI is a web-based service that enriches a publisher's content through automated tagging, categorization, and semantic analysis available as both a free online API and commercial subscription service."

Attensity Group announced on July 8 the availability of its new, hosted Survey Advantage service at a $5,000 per month point of entry. Attensity Survey Advantage "enables departments within large organizations and government agencies to measure, chart and understand customer sentiment and top issues expressed in customer feedback surveys."


Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper was published in June 2009. "This book offers a highly accessible introduction to Natural Language Processing, the field that underpins a variety of language technologies ranging from predictive text and email filtering to automatic summarization and translation. You'll learn how to write Python programs to analyze the structure and meaning of texts, drawing on techniques from the fields of linguistics and artificial intelligence." Visit O'Reilly for information.


A joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing will be held August 2-7, 2009 in Singapore. ACL-IJCNLP 2009 will cover a broad spectrum of technical areas related to natural language and computation.

The 2009 conference of the German Society for Computational Linguistics and Language Technology (GSCL) will include a workshop on the Unstructured Information Management Architecture (UIMA), September 30, 2009, in Potsdam, Germany. "Participants are invited to present applications realized using UIMA, general experiences using UIMA as a platform for natural language processing, as well as technical papers on particular aspects of the UIMA framework. Alternatives to and comparisons of other frameworks - e.g. GATE, LingPipe, etc. - with UIMA are of interest, too."

The third IEEE International Conference on Semantic Computing is slated to be held September 14-16, 2009 in Berkeley, California. ICSC 2009 is "an international forum for researchers and practitioners to present research that advances the state of the art and practice of Semantic Computing, as well as identifying the emerging research topics and defining the future of the field."

Recent Advances in Natural Language Processing RANLP 2009 is slated for September 14-16, 2009 in Borovets, Bulgaria, preceded by September 12-13 tutorials and followed by associated workshops September 17-18.

The ACM Eighteenth Conference on Information and Knowledge Management (CIKM 2009) will take place in Hong Kong, November 2-6, 2009. The conference is sponsored by ACM SIGIR and SIGWEB.

Language and Technology Conference 2009: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2009) will take place November 6-8 in Poznan, Poland. "Human Language Technologies (HLT) continue to be a challenge for computer science, linguistics and related fields as these areas become an ever more essential element of our everyday technological environment... [creating] a favorable climate for the intensive exchange of novel ideas, concepts and solutions across initially distant disciplines."

Text Analysis Conference (TAC 2009) workshops will be held November 16-17, 2009 at the National Institute of Standards and Technology in Gaithersburg, Maryland, co-located with the Text REtrieval Conference (TREC), November 17-20, 2009.

Mining User-Generated Content for Security (MINUCS 2009) will take place December 9, 2009, in Venice, Italy, colocated with the First International Conference on User Centric Media (UCMedia 2009) in Venice, 9-11 December 2009. "The aim of this workshop is to bring together researchers from academia and industry who develop technologies for mining open-source user-generated textual data on the Web, as well as end-users interested in exploiting such technologies for knowledge discovery. The emphasis is placed on large-scale text mining systems..."

Posted July 30, 2009 7:10 AM
Permalink | No Comments |