We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Consolidating, Accessing and Analyzing Unstructured Data

Originally published December 12, 2005

It’s no secret that a tremendous amount of useful business information is locked away in unstructured documents and data files. The amount of this unstructured data, however, may surprise you. At the recent DCI Portals, Collaboration and Content Management conference in Miami, Zach Wahl from the Project Performance Corporation presented the following facts:   

  • 80 percent of business is conducted on unstructured information (Gartner Group).
  • 85 percent of all data stored is held in an unstructured format (Butler Group).
  • Unstructured data doubles every three months (Gartner Group).
  • 7 million web pages are added every day (Gartner Group).

These figures clearly demonstrate that a significant amount of valuable business information is encapsulated in unstructured data. Because of this, many organizations are realizing that consolidating, accessing and analyzing this unstructured data is an important factor in optimizing and analyzing business processes. Organizations can also use this data to gain a competitive advantage.

Unstructured data comes in many shapes and sizes. It may be stored in documents, reports, spreadsheets, web pages, or digital media (images, audio and video). The first step in processing this data is to document, consolidate and manage it. Although most database products can now handle unstructured data, the industry direction is to develop content management applications for managing it. These applications typically use an underlying database system for storing the data, however. The content management applications extend the facilities offered by the database system with support for business metadata, versioning, templates, workflow, business user friendly interfaces, etc.

There is a wide range of content management applications on the market. Examples include enterprise document management, document imaging, electronic forms, web content management, digital media management, as well as email and instant messaging supervision and management. Legislative and legal requirement are also forcing companies to focus on managing records and archiving unstructured content.

The range of content management applications used in organizations often leads to the creation of many disparate content stores. According to a Forrester Research study, over 43 percent of organizations have more than six content stores.

Various techniques exist for providing a single interface to this multitude of content stores. One approach is the business portal, which offers a personalized interface to business content. Developing a business portal involves creating a business taxonomy. This taxonomy must define how information should be organized and accessed in an organization. Then, crawlers are used to scan and analyze the metadata of each content store, and categorize the content based on the taxonomy. The results from the categorization process are subsequently stored the portal directory, which is used by business users to locate and navigate information in the various content stores. The categorization process works more effectively when the source data has associated metadata that describes its business meaning. 

An enterprise search engine is another way to access unstructured content. In this case, the search engine analyzes the actual data in the various content stores and builds a search index for locating the data. Search engines are improving dramatically. In fact, many products now contain analytical capabilities that enable metadata to be extracted from the source content. This extracted information can be used not only to improve the search process, but also to create information for data warehousing and business intelligence applications. IBM, for example, has developed architecture for plugging third-party analysis tools into a search engine. This architecture is called the Unstructured Information Management Architecture, or UIMA. IBM has stated that they intend to put  UIMA into the public domain. Currently, most search engine analytical tools are focused on processing textual data.

Search engine textual analysis and text mining tools provide the ability to extract metadata from unstructured content. This metadata enables organizations to understand the business meaning of the data. It also allows applications to relate unstructured content to associated structured data. Some enterprise information integration (EII) products (i.e.  IBM WebSphere Information Integrator) allow queries to access a federated view of the related structured data and unstructured content. At the same time, ETL and data integration vendors (i.e. Informatica) are also beginning to support unstructured data. These vendors are extending their products to handle the extraction of unstructured content for loading into a data warehouse for analysis. Sound metadata is a key success factor for EII and ETL tools, which are able to processing unstructured content.

Both EII and ETL vendors are also adding support for semi-structured data. An example of this is XML. Industries like finance and banking have already established XML vocabularies for data interchange. This makes extracting business meaning from the data easier.

The technologies and products used to process unstructured and semi-structured data depend on whether an organization simply wants to consolidate their data for easy access, or analyze it for business intelligence processing. This capability illustrates the true value of processing unstructured data. Regardless, the marketplace for handling unstructured and semi-structured data has a very bright future. Clearly, companies must develop strategies for handling such content.

  • Colin WhiteColin White

    Colin White is the founder of BI Research and president of DataBase Associates Inc. As an analyst, educator and writer, he is well known for his in-depth knowledge of data management, information integration, and business intelligence technologies and how they can be used for building the smart and agile business. With many years of IT experience, he has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. Colin has written numerous articles and papers on deploying new and evolving information technologies for business benefit and is a regular contributor to several leading print- and web-based industry journals. For ten years he was the conference chair of the Shared Insights Portals, Content Management, and Collaboration conference. He was also the conference director of the DB/EXPO trade show and conference.

    Editor's Note: More articles and resources are available in Colin's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Colin White



Want to post a comment? Login or become a member today!

Be the first to comment!