Unstructured Data Processing – Why Textual ETL?
Originally published November 21, 2011
Until the last decade, organizations relied on legacy systems, enterprise applications and market data gathered by analysts to make decisions for the business. To make any and all detailed, operational, up-to-the-second decisions, the systems that were in place worked just fine. To take care of any and all detailed analysis and reporting, the data warehouse and data marts were implemented.
Structured ETLStructured extract, transform and load (ETL) is used to transform data from corporate and legacy applications so that the data – once transformed into a uniform, corporate structure – can be examined and analyzed consistently. Structured ETL addresses data integration – transformation, encoding, formatting, DBMS conversions, dimensions of attributes and more.
An example of ETL processing is as follows: Data representing gender is encoded in the input data in the form of (male/female), (m/f), (x/y), and (1/0) from different applications across the enterprise. Once processed, the output for gender is converted and specified simply as (m/f). Another example is dimensions of data attributes that are found in the legacy or applications environment. The dimensions will include lengths that are measured by (inches), (centimeters), or (feet). As output of ETL, data is converted and length is measured uniformly (for example, in centimeters).
Enter Unstructured DataNearly all legacy data is structured. Structured data is repetitive and is defined by attributes and keys that recur over and over. But not all data is structured. There is unstructured data as well. Much textual, unstructured data is found in the corporation. In fact, it is estimated that 80% or more of the data in the corporation is in the form of unstructured text.
Textual data comes in many forms and from many places. Forms of textual data include email of different types; corporate contracts with multiple vendors, employees, customers and more; human resource files; medical records, financial reports; and corporate memos.
How will you read any or all of this data in a given circumstance? Trying to read and analyze textual data without first integrating the text is simply an exercise in futility. There are many reasons why raw text must be integrated before it is useful for analysis. While standardization of data is wonderful in the structured world, as you start looking into the unstructured world, you will quickly realize the challenges that exist in standardizing that data.
Technology advances in the last five years have given us platforms such as Hadoop, NoSQL, Map Reduce and Ruby. These platforms have been engineered to solve the problems existing with current infrastructure such as elastic scalability, compute on demand, self tuning and redundancy. The platforms have created a very robust infrastructure for solving the Internet workload demands and paved way for Facebook, MySpace, Twitter, Groupon and many such new business ventures that create and process large volumes of data on a daily/hourly basis. One can argue that using MapReduce platform, we can solve the unstructured data integration problem. While this is a true statement, this brings along with it enormous problems, including:
The reason for this sentiment is that processing any kind of data in the enterprise is a process that is defined and owned by the business, as they own the lifecycle of data for the enterprise. When it comes to processing unstructured data, the only people in any enterprise who can own and define the rules for this unstructured data are the business users. But business users cannot write ETL or Hadoop code. This is where you will need textual ETL.
Textual ETL, as the name suggests, is a processing technique to solve the problems of unstructured data processing; but unlike other software or rules engines, it is a multi-step process that guides a business user to define the rules for processing any form of unstructured data. Let me explain this with an example of a emerging rules engine Forest Rim Textual ETL.TM
Toxic chemicals – Toxic chemicals can affect you anywhere and any day. Nobody can accurately anticipate and prepare for toxic chemical attacks. Imagine a cloud-based app that can provide vital information on basic toxins and antidotes as well as potential combinations of toxins and their antidotes. Such a thing is possible when you use textual ETL. You can process all types of text on toxins, including images and videos. Then with enriched metadata and the availability of a taxonomy, this app can be run from your smartphone or tablet anywhere in the world, providing potentially life-saving information.
When you want to create this app, you need a few things:
In summary, you need to be able to create a Google-like behavior, but highly subject oriented, integrated, time variant and non-volatile.
This is where a product like Textual ETLTM is useful. It allows you define your business rules in plain English and process your documents through the engine. If you add more rules as you discover insights, you can reprocess documents any number of times. The engine has a built-in machine learning capability that will capture rules and enable you to process data over and over. The output from the engine is a highly usable metadata-based set of information that is ready for consumption with the associated contexts. On top of this, you can simply add a search appliance and you are ready to start exploration. This is something that you cannot get done in a small time frame if you take the usual route of coding. With a textual ETL product, you will have satisfied all the conditions, and yet have a very flexible and scalable architecture.
This is why you need textual ETL processing, and this is where the success and failure of unstructured data or big data processing happens. With this approach, you enable the business user to be the owner of creating the business rules to interrogate this data and process it multiple times for each rule condition and context, which will resolve content disambiguation. This process will also resolve the ownership question of whether IT or business is the responsible owner for unstructured data processing in the enterprise.
Remember that processing unstructured data is a solution architecture that will use a myriad of technologies, but a product like Textual ETLTM will simplify the processing portion of that solution architecture and will put the power of defining the business rules for data interrogation in the hands of the business users.
Recent articles by Krish Krishnan
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC