Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration
Originally published June 27, 2012
A great debate is raging in the industry, and it is being fanned by the adoption of "big data." The simple question is: Do we create better search techniques or do we go all the way to text analysis for integrating unstructured data? A simple answer is to say “yes” to both the questions, but there are hidden layers of complexity in the answer, which this article will attempt to explain.
Search vs. AnalysisAt a fundamental level, both search and analysis engines operate on text data. Here is where the similarity ends. With search, you typically look for patterns and present the findings to the user in short order. There is no further transformation to the text. Analysis deals with the discovery of the pattern (akin to search); but, more importantly, transformations are applied to the text to create a meaningful outcome. Analysis assumes that text must be integrated and transformed before it can be analyzed. This advanced treatment of text in terms of analysis is where complexities arise, and the field – though decades rich in terms of algorithms, research and development, and published theses – continues to be nascent and niche.
The fundamental characteristic of text is termed best in one adjective “erose” (do not confuse with “verbose”). The Latin word “erose” means “irregularly notched, toothed, or indented”(from dictionary.com), and is used more in botany to describe leaves of a plant. The underlying reason for this attribution is text is long, complex and unpredictable. It is a combination of words and phrases to form contextual statements, which may contain repeatable patterns (this repeatability can also differ based on context within a single document or text). When discussing “unstructured” data, we use this lack of repeatability and the associated ambiguity to distinguish text data analysis and outcomes, as opposed to structured data where there is great repeatability of data, a structured and formatted storage architecture, which lends itself well to integration and analytics.
Applying Search for Unstructured IntegrationWith the available search infrastructure and algorithms, one can make the argument that in order to integrate any “unstructured data,” why not just extend search outputs? Why do we need to create a text analysis platform separately? There have been attempts at doing that, but including integration and transformation as part of search is not a good approach.
Text AnalysisLet’s look at how analysis will be different from search:
Text analysis advances the integration of unstructured data beyond just light indexing and pattern matching of search.
Analysis consists of multiple transformation steps, each of which needs to be run once per set of patterns, metadata terms or context.
Analysis creates multiple iterations of metadata output as opposed to simple result sets of entire pages, which create a powerful set of indexes within the text and its context.
Analysis always processes data in a consist manner as opposed to search.
For example, here is a popular example found in Wikipedia under Natural Language Processing
The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it.
Depending on which word the speaker stresses, you can see how this sentence could have several different meanings.
If you search for this pattern, you will get all the statements, and you have to search for the extended meaning and interpret the same. If you process this through a text analysis platform, you can create a context-oriented result set that will provide you not only the result, but also the associated context, which is far more useful.
The need for transforming data before it becomes useful for analytics and reporting is not a new thought. We have always designed the data warehouse to process data in this fashion, and call it ETL. Extending this analysis to text creates a powerful concept: textual ETL.
This need for transformation and integration of text has some interesting challenges. One challenge is the size of the data to be transformed. Let us assume that you intend to take the Internet as your data set. Is it possible to transform and analyze all the text found on the Internet? In a nutshell, it is not practical or feasible. In such a situation, you primarily rely on search and can use a subset of data from the result set for deeper analysis.
But there are other data sets such as enterprise data that are large in volume, complex in formats and have multiple contexts, yet lend themselves to rigors of text analysis and processing. A simple example is the contracts existing across the different business divisions such as purchasing, supply chain, inventory management, logistics, transportation and human resources. Each of these contracts has a different purpose, and there may be many contracts of a type that can provide insights beyond just start and end dates. Insights include legal terms and conditions with applied context, liabilities and obligations and much more. After analysis, such text will create a powerful and rich metadata output with context that can be simply integrated into a decision-support system ecosystem.
Other challenges include the variety of formats, the volumes of text, the ambiguous nature of the data itself and lack of formal documentation, to name a few. But once the challenges are addressed, the output from such an analysis is powerful to create a huge visualization platform for looking into text and unstructured data within the enterprise. This is where you can leverage the data that has been stored on content management platforms for years for useful output of trends and behaviors.
The major differences between a result set produced by a search and text analytics system are as follows:
Text AnalysisBased on the discussion here, you can discern that search is good for finding things on an ad hoc basis in a large set of data. Analysis is good for creating a platform that can be used repeatedly against a large but finite amount of textual data as related to a corporation.
In order to perform text analysis and deep text mining, you need to process the text rather than extend a search engine or appliance. A robust text analysis system will provide for the following:
The major advantage of text analysis is the ability to track changes as they occurred or occur within the text environment in a similar manner to tracking changes in a dimension. This is the most powerful output that makes analysis such a better proposition than search and is called document mid-point reprocessing. You can extend this concept to emails, Excel spreadsheets and other document types very easily.
In conclusion, search and text analysis both serve different purposes for processing unstructured data and can be effectively leveraged. Search can be used for early stage data discovery, and text analysis can be used for the detailed analysis and downstream analytical processing. But remember this: Do not substitute search as the alternative to traditional text analytics.
Recent articles by Krish Krishnan
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC