Originally published January 27, 2009
Open source is a great choice for many text analytics users, especially folks who have programming skills, who need custom capabilities or who are trying to get a feel for possibilities before committing themselves. Excellent options are available for all these users. Tools such as Gate, NLTK, R and RapidMiner share the low cost, power, flexibility and community that have driven adoptionof open-source software by individual users and enterprises alike. RapidMiner even combines text processing with business intelligence (BI) and visualization functions.
This article will look at open source text analytics, focusing on those four tools. (UIMA, the open source Unstructured Information Management Architecture, is a rich topic in itself, one that merits its own article.) I will suggest a number of resources that will help you get started. Keep in mind that since these tools are open source, you can simply download them and try them out!
Be warned, however, that just as in other IT domains, open source text technologies are not for everyone. Open source tools have their strengths; but organizations that are looking for polished user interfaces, for the responsiveness expected by paying customers of commercial vendors and for adaptation to particular business problems may prefer one of the many attractive, closed source alternatives on the market.
Lastly, hosted “as a service” options are very popular among new corporate users, but there are no significant, open source-based SaaS text analytics offerings available. Open-source software, because it's free, does lower the baseline costs for companies that offer value-added extensions and hosted options, but in any case text analytics as a service (TAaaS) subscriber users lose the freedom to customize and extend the tools.
If you’re a Java, Python, Perl or R programmer – if you’re looking to build text analytics into a homegrown system or if you have a stable of programmers at your beck and call – open source is the way to go. Same if you're a tech-savvy data miner or data analyst, especially if you like to dig into the algorithms and if you don't have the funds for commercial software or services.
Look first at Gate, the General Architecture for Text Engineering, which is free for download from the project sponsor, the University of Sheffield in the UK, or sourceforge.net. Gate comes as a workbench-style GUI and a set of Java classes. I use it myself, and I've been in touch with users at organizations including LinkedIn and information service giant Thomson.
Figure 1: Gate provides off-the-shelf annotation and information-extraction capabilities via a workbench interface
Gate is an ace at information extraction (IE). (For a glossary of terminology, I'll refer you to part 1 of my BeyeNETWORK article, Text Analytics Basics.) It comes with a default natural-language processing (NLP) annotation pipeline called ANNIE. (Annotation, also known as text augmentation, is when the software recognizes and applies XML tags to features such as names, geographic locations, phone numbers, etc.) Gate has a plug-in architecture; a slew of extensions are freely available. Plug-ins adapt Gate for selected international languages; provide interfaces for the Google and Yahoo search-engine APIs and information retrieval (IR) tasks; support training, application and evaluation of machine-learning models; and more.
Gate is distributed under the GNU Library General Public License (LGPL) version 2. You're free to use the code for commercial purposes and embed it in commercial software. LGPL is a copy-left license: if you alter the code, you have to release your version as LGPL open source.
Are you interested in some simple steps that will get you started? Download the software and then check out instructions provided by Gate user Trevor Stone. Also, you’ll find some very helpful recorded tutorials at the Sheffield site plus extensive documentation, although frankly some of it is out of date. I’ve found Manu Konchady’s book, Building Search Applications: Lucene, LingPipe, and Gate, to be a clearly written, helpful, step-by-step guide to applying Gate, in conjunction with Apache Lucene open-source search software and Alias-I’s LingPipe pseudo-open source, Java NLP, clustering and classification toolkit.
Clustering and classification are examples of data mining functions: in the text context, discovering ways in which terms or documents are similar and may be grouped, and then binning terms or documents into categories that may be discovered clusters or that may have been defined in some other way. In the text context, we may generate clusters representing topics or themes discovered in a training set of documents and then classify each new document into the cluster category it best matches. Data mining also seeks to discover links and association rules.
If you have data mining background, RapidMiner and R are strong text analytics options.
R is an open-source implementation of the S statistical programming language, which was developed at Bell Laboratories starting in the mid-‘70s. R is available under the GNU General Public License, which allows commercial use. Look in particular for tm, the R Text Mining Package, and for other useful modules and software interfaces listed under the natural-language processing task view. The paper “Text Mining Infrastructure in R” will help you along.
RapidMiner is commercial open source, available in a free, community edition under the GNU Affero General Public License (AGPL) (which is similar to the LGPL used by Gate, with adaptations for networked software use) and also a closed-source (commercial) license for those who wish to embed the software into proprietary, commercial products. RapidMiner was developed at the University of Dortmund, Germany, and was formerly known as YALE (Yet Another Learning Environment). The university spun off Rapid-I in 2007 to develop and support the software.
The software supports a wide variety of clustering, classification and other data mining functions. Text-related modules include the Word Vector Tool (WVT), Named Entity Recognition (NER) and Data Stream plug-ins. According to Rapid-I cofounder and managing director Ralf Klinkenberg, RapidMiner responds to a very wide set of text-processing needs including news filtering, email routing, sentiment analysis and opinion mining, and general information extraction. According to Klinkenberg, RapidMiner allows users to combine unstructured data (text documents) and structured data (database tables, time series data, audio data, etc.).
Who should look into RapidMiner? Ralf Klinkenberg says,
Rapid-I provides an introductory RapidMiner video on its page, a free 500-page RapidMiner tutorial (PDF), a free RapidMiner Text Mining plug-in tutorial and an online tutorial within RapidMiner with more than 30 example data mining processes that can be interactively modified and applied including extensive explanations of these processes. RapidMiner comes with many more example data and text mining processes. Hence, RapidMiner is a tool not only for data mining professionals, but also for beginners, students, researchers and people who just want to get started with data mining to better solve their business problems.
You can include Gate and RapidMiner Java classes in your applications; but if you’re a Java programmer (or you supervise one), you may wish also to look into the OpenNLP site. Really OpenNLP is a collection of disparate software components, libraries and programs. The common thread is that the listed projects are all open source and they all implement some form of natural language processing or an associated function such as machine learning.
If you program the Python open source scripting language – excuse me, the Python “dynamic object-oriented programming language” (I wrote my first Python program back in 1996 when it was a lot less functional and still use it for ETL programming) – skip OpenNLP and jump right to NLTK, the Natural Language Toolkit, a set of “open-source Python modules, linguistic data, and documentation for research and development in natural language processing, supporting dozens of NLP tasks.” A Creative Commons licensed e-book, available online in PDF and HTML formats and containing plenty of examples and sample code, will help you along immensely.
If you’re a Perl programmer, switch to Python. It’s not too late. But if you insist on sticking with Perl, consult Practical Text Mining with Perl by Roger Bilisoly or Manu Konchady’s Text Mining Application Programming.
There isn’t space to detail all the open source text tools and toolkits that are out there. Many of them are specialized, focusing on particular functions rather than providing a comprehensive feature set. Many are the product of academic research and fall short of enterprise usability and code quality. The tools I have described – Gate, RapidMiner, NLTK, R and the OpenNLP collection – are starting points. They cover a spectrum of text analytics tasks and will satisfy a spectrum of users. Some users will try them to get an introduction to text analytics and then move on to closed source, commercial tools. They'll be perfect for other users, those who are technically adept or need code they can link into custom applications.
The variety of available text technologies, both open and closed source, is a testament to the vibrancy and vitality of the field. With so many choices, you're likely to find software that meets your needs. If you don't, chances are that one of the open source options will provide a platform for you to build out a system that does just what you need.
Recent articles by Seth Grimes