Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« VLDW: What happens in a scaled cluster? | Main | Is it time to re-define your Data Warehouse? »

Hidden in the un-structured information...

Welcome again, unstructured data is a hard thing to grasp, let alone to process; but if we (businesses) are going after it, then we MUST have a reason. That reason? There must be value in the information hidden in the unstructured layers - after-all, what is "unstructured" data anyway? I think free-form text, is still semi-structured, images are semi-structured, emails, word-docs, and other such elements - they are all structured to some degree, otherwise programmatic approaches would not be able to display the documents, search the images, allow alterations, perform matches. I think what we should be focusing on in the Data Warehousing / Data Integration industry is how to best leverage the "unstructured information" programs and algorithms already built.

Think about it, with images there are all kinds of image processing programs, image matching, alteration, consolidation, over-lay, resizing, colorization, and so on. For drawings, there are cad-programs, element tags at the end or in the middle of the image that explain all the components. For chemical images there are sets of commands and tags that explain how to build a rotating 3D visual of the chemical elements and their associative parts. For word-docs, and other docs there are "parsing and processing programs" like Microsoft Word, and KDE KOffice (open Source), Star Office, and so on. For e-mails, there are many different programs - but most of the email traffic can actually be "sniffed" off TCP/IP packets without much damage to the content (if any today).

Given this definition, the question I have truly, is WHAT IS Unstructured data? I'm not so sure it's such a good term to use, but let's just accept (for the purposes of this entry) that unstructured data is everything that isn't defined (easily) by a standard RDBMS table structure - without blobs and CLOBS of course; let's pretend that everything defined by a BLOB or CLOB is considered "unstructured" for a minute and then return to the question above.

Ten years ago (or more) I worked as an employee for a government manufacturing corporation, big money, big contracts, compliance, and unstructured data. Our manufacturing plan was filled with unstructured data. At that time we needed (as a part of our effort) to integrate parts drawings, and to look for text within the CAD drawings to figure out what impact it had on the plan; in other words, annotations for specific parts drawings. Now back then, the CAD images were just that, CAD images - and picking the text out wasn't as simple as "looking for the text attached to the image". We literally had to process vector graphic commands.

Why? What was hidden in our unstructured information?
In our case, instructions, and plan estimations. The company was going through SEI/CMM, lean-initiatives, SAP implementation, business process re-engineering, compliance and so on. They were trying to help improve the efficiency of the planners and ensure the right image was attached to the right descriptive paragraphs which explained the build process. There was (and still is) inherent value to the business to process the unstructured information.

Why is this important for us?
Because unstructured information processing is hot now for the commercial world. There's value hidden in these documents, and we need to understand (as a corporation) where that value exists, and how it can impact our business. Bill Inmon shows a wonderful demonstration of finding "gas-pipeline" problems by providing topographical maps or manufactured landscapes based on word-association and frequency, from scanning unstructured (semi-structured) documents across the organization. Improving communication and spotting problems before they occur is a huge benefit.

So how do we access this information easily today?
Well, if you're like me and you don't want to actually launch word, excel, or graphics editing programs in order to "scan the screen to capture content", then you'll want to investigate the use of an EII tool. EII tools bring with them the ability to process unstructured and semi-structured information, through the use of SQL queries, XQueries, and other potential mechanisms.

What you do with the information after you discovered it should actually be pre-determined by the business case, or the reason for purchasing and installing EII in the beginning. As usual, the business needs to drive the need for IT to solve the problem of accessibility. Establish the value of "finding" and "using" the data in the unstructured world before you set out to implement.

What are some of the EII's strengths today?
* Ability to access XML documents
* Accessibility to word docs, excel, power-point
* Ability to access emails

What are some of the features that EII will need in the near future?
* Ability to parse, access, and pull text from various image formats
* Ability to use image match and compare algorithms (widely available on the market), say for matching thumbprints, and retina scan images.
* Ability to query CAD images, layered images, and process "statistics" about the images. Making use of the statistics about an image can be much more powerful than making use of the image itself. EII of the future will focus on providing high quality access to "summarization" of existing images in a standardized format.

Remember, summarization and what is done with that summarization of unstructured and semi-structured information can often shed light on "how" these documents are utilized, or meet the business requirements set before them. EII is a tool that can and should help in these areas, don't forget unstructured Search tools as well - EII should partner up with these vendors in order to have a wider grasp of "tagging" technology and summarization/scoring technology.

The best use of Unstructured/Semi-Structured data is the one that has a predefined business question/business case to answer to.

Are you accessing unstructured/semi-structured data? I'd love to hear from you - what are your challenges or successes with what you've done?

Thanks,
Dan L

  Posted by Dan Linstedt on March 6, 2006 6:05 AM |

Post a comment