Blog: Dan E. Linstedt« VLDW: What happens in a scaled cluster? | Main | Is it time to re-define your Data Warehouse? » Hidden in the un-structured information...Welcome again, unstructured data is a hard thing to grasp, let alone to process; but if we (businesses) are going after it, then we MUST have a reason. That reason? There must be value in the information hidden in the unstructured layers - after-all, what is "unstructured" data anyway? I think free-form text, is still semi-structured, images are semi-structured, emails, word-docs, and other such elements - they are all structured to some degree, otherwise programmatic approaches would not be able to display the documents, search the images, allow alterations, perform matches. I think what we should be focusing on in the Data Warehousing / Data Integration industry is how to best leverage the "unstructured information" programs and algorithms already built. Think about it, with images there are all kinds of image processing programs, image matching, alteration, consolidation, over-lay, resizing, colorization, and so on. For drawings, there are cad-programs, element tags at the end or in the middle of the image that explain all the components. For chemical images there are sets of commands and tags that explain how to build a rotating 3D visual of the chemical elements and their associative parts. For word-docs, and other docs there are "parsing and processing programs" like Microsoft Word, and KDE KOffice (open Source), Star Office, and so on. For e-mails, there are many different programs - but most of the email traffic can actually be "sniffed" off TCP/IP packets without much damage to the content (if any today). Given this definition, the question I have truly, is WHAT IS Unstructured data? I'm not so sure it's such a good term to use, but let's just accept (for the purposes of this entry) that unstructured data is everything that isn't defined (easily) by a standard RDBMS table structure - without blobs and CLOBS of course; let's pretend that everything defined by a BLOB or CLOB is considered "unstructured" for a minute and then return to the question above. Ten years ago (or more) I worked as an employee for a government manufacturing corporation, big money, big contracts, compliance, and unstructured data. Our manufacturing plan was filled with unstructured data. At that time we needed (as a part of our effort) to integrate parts drawings, and to look for text within the CAD drawings to figure out what impact it had on the plan; in other words, annotations for specific parts drawings. Now back then, the CAD images were just that, CAD images - and picking the text out wasn't as simple as "looking for the text attached to the image". We literally had to process vector graphic commands. Why? What was hidden in our unstructured information? Why is this important for us? So how do we access this information easily today? What you do with the information after you discovered it should actually be pre-determined by the business case, or the reason for purchasing and installing EII in the beginning. As usual, the business needs to drive the need for IT to solve the problem of accessibility. Establish the value of "finding" and "using" the data in the unstructured world before you set out to implement. What are some of the EII's strengths today? What are some of the features that EII will need in the near future? Remember, summarization and what is done with that summarization of unstructured and semi-structured information can often shed light on "how" these documents are utilized, or meet the business requirements set before them. EII is a tool that can and should help in these areas, don't forget unstructured Search tools as well - EII should partner up with these vendors in order to have a wider grasp of "tagging" technology and summarization/scoring technology. The best use of Unstructured/Semi-Structured data is the one that has a predefined business question/business case to answer to. Are you accessing unstructured/semi-structured data? I'd love to hear from you - what are your challenges or successes with what you've done? Thanks, |