Perspectives on Text Analytics in 2009

Originally published February 24, 2009

You’re reading this article because you’re actively interested in text analytics technology and market trends. Whether you’re a solutions provider or consumer, you want to be sure you’ve invested your time and financial resources wisely.

The picture is complex, however, as the text analytics domain is not dominated by any one algorithm or approach, vendor or business sector. So far, the diversity of options has benefitted potential users; awareness (and realization) of the technology’s capabilities is building rapidly in areas such as media monitoring, publishing, customer-experience management and semantic search. Meanwhile, growth remains strong in domains that have long applied text analytics. These areas include life sciences, intelligence and law enforcement, and financial services. While the text analytics domain isn’t unaffected by current business conditions, economic pressures could actually spur uptake motivated by a quest for efficiency through automation.

Sources and Perspectives

To stay on top of trends, I try to talk to current and prospective users as often as I can, and I also try to keep up with researchers and fellow analysts. (You’ll find links for some of the folks I follow at my BeyeNETWORK channel.) I also catch up with vendors periodically.

Because it’s helpful to understand current industry perspectives, I recently invited CEOs, CTOs and thought leaders to respond to the following query:

What do you see as the 3 (or fewer) most important text analytics technology, solution or market challenges in 2009?

Before relaying answers I received, I’ll mention two meta-replies. First, Marti Hearst, associate professor in the UC Berkeley School of Information, declined to respond, in part because “…for me, 2009 is too close a time horizon.” This thought applies not only to academic researchers but also to anyone who’s thinking strategically, beyond this year’s crop of product releases and business conditions. Next, Lexalytics CEO Jeff Catlin candidly characterized his response as “somewhat self-serving thoughts.” No doubt many of the vendors quoted see the greatest challenges as residing in areas their tools address, so take their views with a grain of salt.

I’ll relay all views with minimal editing, complete with vendor self-promotion because, after all, commercial products are “where the rubber hits the road” for most enterprise users.

Text Analytics in 2009

Let’s start with Maria Milosavljevic, CTO at Capital Markets Cooperative Research Centre, an Australian group that is conducting leading-edge R&D in real-time security-market surveillance. The top 3 for Maria are:

  • Expectations: Unfortunately I think that the most difficult challenges of the past are still with us, and this is the biggest. People either believe that not much at all is possible or they believe that more is possible than reality. It is never the case where end users have realistic expectations.

  • Portability: Text analytics systems trained on a particular type of input data do not typically transfer well to other types of input. Cross-domain issues (e.g., from news to medical) are similar to transfer between intra-domain input documents (e.g., from medical news to medical journals).

  • Data quality: Data is never clean. Garbage in = garbage out!

Breck Baldwin of Alias-i, author of LingPipe natural language processing software, is very to-the-point in his reply, with a sort of twist on Maria’s expectations response.

#1 thing the field needs is a profound and real success story of text analytics.

The reply from Sid Banerjee, CEO of Clarabridge, looks at factors related to scaling text analytics to the enterprise, “the expansion of text analytics across a few dimensions.” His reply:

  • From a functional to an enterprise imperative. In 2006-2008, text analytics was deployed to marketing, call center, survey groups. Over the past few years, solutions have morphed from those that serve one group to an offering that supports multiple groups, and now customers are looking, on day one, to pick solutions that can demonstrate cross-functional value. The selling cycles are more complex, the solutions need to show more a priori business value and relevance, and at the same time, because the solutions are not just deployed in one area, IT wants to know more how the solution is going to fit with "enterprise" standards. In total: selling is now more complex, more constituencies are involved in the decision, but at the same time articulating and demonstrating business value is even more important.

  • From an isolated to an integrated solution. Toward the end of 2008, we started seeing more interest from companies and partners looking to seamlessly integrate the results of text analytics back into operational systems, i.e., process customer verbatims and merge the categorized, scored results back into call center applications. This new use cycle – not just for analytics, but for operational integration – provides an interesting opportunity for text analytics vendors to consider whether to be a stand-alone application or to be more tightly integrated with partner products and offerings.

  • From a static to a scalable architecture. The big story in 2008 was scale. Data volumes are going up. Usage requirements are going up. There's no reason to expect scalability requirements won’t continue to grow in 2009 and beyond. The winners will be those who can see beyond today's data and user volumes and design for an order or two more magnitude in their offerings, predicting this certain future.

Aaron B. Brown of IBM similarly looks at architectural issues and also at business concerns. Aaron is Program Director, ECM Discovery, for IBM Information Management Software and has great market and technical insights. I interviewed him last year on “Text Analytics for Legal Compliance.” That exchange is still a good read. Here are his thoughts on challenges for 2009:

  • Defining the business case for text analytics. In the current economic situation, organizations are clamping down on new projects and more than ever looking for hard ROI savings to justify investment. To pass the funding bar, text analytics solutions, which typically fall in the category of new projects undertaken for business optimization, need to come with solid business cases that demonstrate hard-dollar operational savings based on proven examples. Given the emerging nature of many text analytics solution areas, this will be a challenge to growth in 2009.

  • Creating fully interactive exploratory analytics. Most current text analytics approaches rely on extensive design-time configuration and customization. This has limited its applicability to use cases where categories of extracted entities, etc. are understood up front. An emerging class of investigation-centric applications, such as eDiscovery, requires text analytics that can be rapidly and seamlessly reconfigured by the business analyst as they explore data sets and interactively refine their understanding of what needs to be extracted. Further, these applications add pressure to democratize text analytics, as the business user can no longer wait for a linguistic or data processing expert to reconfigure the analytics each time a new insight is reached. The new challenge for text analytics is to enable extraction and analysis that are near instantaneous and at the same time are reconfigurable at interactive speeds by a business user without specialized text or linguistic expertise.

  • Expanding mainstream use cases for text analytics. To date, text analytics has had strong traction in certain niche solution markets. However, it has yet to be widely adopted as a horizontal capability powering broad business optimization use cases in content-centric (e.g., ECM), process-centric (e.g., BPM) or data-centric (e.g., BI) applications. As the market moves toward mainstream usage, text analytics technologies (and their relatives in text classification and search) will need to integrate more tightly into information and process management platforms – potentially being subsumed into them as native capability – and solutions will shift to leveraging analytics in conjunction with broader information management and technologies. 2009 will be a critical year for text analytics to start making this transition.

Other replies also looked at business conditions and at experiences in the adoption and use of text technologies. Lexalytics CEO Jeff Catlin foresees:

  • The poor economy will shift the sale of text analytics to larger companies. We're already seeing this in our sales, which used to be about two-thirds small companies and one-third large companies. Now it's mostly larger companies. The upside of this is that big companies are spending on search and text analytics as they seek to save money and increase efficiency due to lower staffing levels.

  • Technical push: Our push this year is going to be on empowering the customers with easy-to-use linguistic tools that will allow users to build and deploy some of their own text analytics bits and pieces using our core frameworks. The first of these for us is a user-driven entity recognizer that will allow users to mark up domain-specific text (let’s say medical text) with entities (diseases); and then after marking up a hundred or so stories, the system will build a "model" or recognizer for that type of entity so that it can deduce other diseases from how they’re described in the text. We'll be releasing our first version of this by late February/early March and hope to further enhance the tool over the course of the year to further empower the users. Our initial market research indicates that publishers/media would find such tools very valuable because of the amount of time/money they spend on maintaining lists.

  • I expect the number of vendors to contract as those with weaker or narrower offerings find it increasingly difficult to sell in this very challenging environment. To make a go of it in 2009, it seems that at a minimum, vendors need to provide: entity extraction, concept extraction and possibly sentiment analysis.

Craig Norris, CEO at Attensity, similarly noted three varieties of challenge, in his case relating to technology, implementation and seeing opportunity in adverse business conditions:

  • Getting sentiment right: Many of the vendors in the text-analytics space have pretty good text classification technology – where the application classifies terms based on dictionary entries or predefined lists. The output looks good – but because only terms are looked at, there tend to be a lot of false positives – negations are missed, relationships (such as why someone is unhappy) are never found. Many customers have purchased technology that only goes this far. They will be challenged this year by consumers of the data who find the data is, in fact, wrong. Natural language processing (NLP) technology like Attensity's not only finds sentiment; it can find sentiment in context, determining accurately whether it's negative or positive, the degree [or intensity] of it, and the “why” behind it.

  • Getting to Action: Another challenge this year will be around getting past high-level views of the data (general ratings of sentiment, general views of issues) to the root cause so that users can take action and companies can get real value. For example, knowing not only that a customer is unhappy with a product, but also that they intend to return it if they don't get a call back or a fix for their problem is critical to being able to remedy a situation and save a customer. Being able to know this is where the real ROI is for text analytics products. This is only possible with an approach that not only finds sentiment but can also understand the relationships between the issues and the reasons for the issues.

  • The economy: Certainly the economy will be a challenge for any technology solution vendor this year. Our offering has proven to enable two things to help companies in a tight economy: revenue preservation – find out if customers are going leave, understand why and save them; and revenue growth – understand if customers are having issues ("cries for help") and delight them by acting on those issues. This generates amazing word of mouth and ultimately growth.

Solutions Focus

Vendor prognosticators with a solutions focus include Keith Collins, CTO at SAS, and Manya Mayes, SAS Chief Text Mining Strategist, see as challenges for 2009:

  • A broader set of vertical/horizontal offerings including more automated unstructured (text, voice, image) capabilities must be delivered for customer/product/competitive intelligence. SAS is doing this, for example, with SAS Text Miner integration into SAS Warranty Analysis. Automated capabilities include graphics, sentiment analysis, net promoter scores, and key performance indicators (KPIs) from text analysis results.

  • Solutions providers must take text analytics and search across the breadth of their offerings: For SAS, text+DI [data integration], text+BI[business intelligence], text+DataFlux, text+JMP, text+SOO [Service Operations Optimization], etc. SAS customers are requesting these capabilities, and we are building software and planning road maps accordingly.

  • Customers are consolidating software and needing to settle on one vendor that can handle all approaches to text – text analytics/mining/business analytics/search/categorization.

  • Collaboration and mobile BI are critical needs, and vendors need to be agile and move technology swiftly to meet the needs of consumers.

Yves Schabes, President of Teragram, a SAS company, focuses his response on one particular challenge – increasing workers’ efficiency:

Given the current economic situation, large organizations are forced to keep up a fast pace of business with fewer resources, and it is therefore critical that these organizations have all of their content organized correctly so that employees can spend more time fulfilling their jobs and less time searching for information on their enterprise's system.

Neil Hartley, CEO of Leximancer, sees “an industry that has been around for a long time and yet has seen little in the way of being operationalized within business processes.” Neil continues:

Yes, there are proofs of concept, run by experienced vendor staff but these [prototypes], in my experience, rarely get adopted by the business a) because of the setup and maintenance required, and b) because the source data vocabulary changes making the initial setup redundant. Autonomy is the exception, but their deployments are largely search/retrieval-based and less qualitative.

What the business needs is a high degree of automation (without the need for excessive setup and maintenance) together with clarity in the analysis and the ability to apply control to the process where needed. This is exactly what Leximancer provides.

The other major trend I see is the need to make social media actionable, and this is something we’ve focused on heavily on our blog. I think customer attitudes on social media or microblogging sites are a leading indicator for the business. The business that waits for these trends to be reflected in their formal feedback programs may find it is too late to take effective action.

Technology and Applications

I had expected that a higher proportion of 2009 “challenges” responses would center on the technology and its applications. Responses I’ve already quoted touch on those areas with perhaps greater attention to business and market concerns. Eric Martin, Product Marketing Manager at SPSS, focused solely on solutions, which to me reflects both Eric’s confidence in his company’s market position and his own background: Eric earned a Ph.D. in immunology and cellular biology and, like many others, started using text mining on biomedical articles before joining text analytics pioneer LexiQuest, which was later acquired by SPSS. Challenge areas Eric sees for 2009 are:

  • Blog analysis: Already hot and will probably get bigger in 2009. There's a lot of confusion though between blogs and other kind of user-generated content or Web 2.0 data. That's also very solution-oriented and not only requires text analytics but also data mining, Web scraping capabilities, etc.

  • Sentiment analysis: Not an option anymore in most projects. Part of Voice of the Customer, blog analysis applications, etc. High quality out-of-the-box results are demanded more and more.

  • Automated translation: Also getting more and more market traction prior to text analysis.

Matthew Hurst of Microsoft Live Labs is likewise a techie at heart, and he’s also very interested in user-generated and other online content. Matthew says, “As reported elsewhere in research and industry literature, the majority of textual data being published online today is from the many genres of social media. To fully leverage this data, the key challenges are:

  • Comprehensive and complete data acquisition: Due to the social nature of this content, missing documents or authors is like missing the replies to answers or key voices in the choir.

  • Tuning or recreating standard tools to deal with less formal content: Tokenization, sentence segmentation, part-of-speech (POS) tagging, parsing, etc. all have different qualities and requirements in the social space.

  • Development and upkeep of broad (product) ontologies: This is a key requirement for grounding any analytics.

Lastly, Ren Mohan, Co-Chairman and CTO of IxReveal, replied not only with the three challenges his company hears most often from our clients, but also with a couple of “megatrends” he is beginning to see. Challenges are:

  • Speed to implementation: Clients remain concerned about implementation taking months before realizing any benefit. Immediate payback, more like a couple of weeks, seems to be the mantra!

  • Systems should dynamically and quickly react to changing new information: Clearly, clients do not want to go through an extensive process of rebuilding their analysis when requirements for analysis change. All our customers keep changing their analysis as new information unfolds.

  • Finding actionable intelligence in data that is not grammatical: Less time in call handling, texting, little patience, quick notes by data sources – this implies an increasing amount of text data has embedded meanings but is less grammatical. So, our clients are asking us to help them analyze such data as well.

Ren’s “megatrends” are:

  • The line between structured and unstructured is blurring and a new trend is emerging: Users want a new approach altogether. As opposed to converting unstructured to structured, users would like to see it the other way, and they want it quickly and easily. In the ISO [insurance industry] fraud/claims management conference, the keynote presenter asked how many wished their systems could respond with Google-like access and ease; 100% of the audience raised their hands. When he asked how many of their systems behave like that today, no hands went up! They are all using SQL. So, we see a new trend in analysis going from structured (SQL like complexity) to unstructured (search like simplicity)!

  • Thus far, text analytics has been available only for mega corporations! Now, the trend is catching on in the consumer/desktop user world as well. Products like uReka! from IxReveal and SearchWiki from Google are examples. They give users a way to store and reuse search queries and links, and personalize them. This is the beginning of analysis as users have to now think (analyze) about content databases, links, social networks, etc.

Ren’s conclusion: “These are truly exciting times!” Indeed.

A Live Challenge?

Here are thoughts on one more potential challenge for 2009, a live challenge, of a different variety.

As an aside while forwarding his take on 2009 text-analytics challenges, Lexalytics’ Jeff Catlin suggested a bake-off to be held at the 2009 Text Analytics Summit, which is slated for June 1-2 (preceded by tutorials) in Boston. Jeff says in his blog, “Our market is still dominated by too much flashy marketing and not enough down in the dirt numbers for ‘apples to apples’ comparisons.” Jeff is right, although I’ll add that I believe that most of the flashy marketing actually fronts very capable products. Perhaps we can get something going for this year – a commercially oriented version of the TREC (Text REtrieval Conference) challenge. Stay tuned, and do consider attending this year’s summit.

And if you’re a current or prospective text analytics user and would like to tell me about challenges you face or expect to face, please do get in touch by email at grimes@altaplana.com or by phone at (301) 270-0795).

  • Seth GrimesSeth Grimes

    Seth is a business intelligence and decision systems expert. He is founding chair of the Text Analytics Summit and principal consultant at Washington, D.C., based Alta Plana Corporation. Seth consults, writes, and speaks on information-systems strategy, data management and analysis systems, IT industry trends, and emerging analytical technologies. Seth chairs the Sentiment Analysis Symposium and the Text Analytics Summit.

    Editor’s Note: More articles and resources are available in Seth's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Seth Grimes



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!