Ok, so maybe a piece of software can't really party - but we can! :) Claudia just posted a blog on the need for garnering semi-structured and unstructured data within the enterprise warehouse. Bill Inmon has got an unstructured/semi structured data retrieval and visualization tool, we see more information being pushed under the compliance umbrella.
That leaves us asking many questions, like: Do I need to monitor all e-mails? How do I decide what's important and what's not in my "sea of word-docs?" How and what impact does it have on my EDW?
It will take a long time to answer all those questions, but one thing in the EII world that has been overlooked is it's ability to access, reference, and integrate semi-structured and unstructured data.
Someone somewhere once said: "only 20% of the worlds data lives in the structured realm, the other 80% lives in semi-structured and unstructured content." Well, if this is really true, and we've seen ROI's for EDW's as high as 400%, then what do you think the ROI could be when integrating the other 80% of our business? It certainly should raise some eyebrows.
Now I'm not suggesting that EII replace ETL, and in fact there are some misunderstandings out there about ETL - one of which says: ETL handles only Batch, and is used for only historical data - this simply is not true. Alright, 80% of the time this may be true, but there are times when an Active Data Warehouse has been built and ETL is utilized on a 5 minute or 3 minute refresh increment. I've also seen ETL utilized with Queuing mechanisms for real-time transformation (by no means an easy task). There's another customer using ETL to synchronize all their source systems across the enterprise and they don't even have a warehouse.
But: ETL also works with only STRUCTURED data. To make ETL "fit" a real-time integration paradigm is like a round peg in a square hole, challenging, costly, and increases complexity.
Now this is where EII really begins to shine, EII can make it much easier to integrate real-time data - not to mention unstructured and semi-structured data. Let's focus on the following two components: e-mail and documents. What if the metadata for my warehouse was stored in an "appendix" or glossary of terms in a word-doc? What if I had answered 4 or 5 key questions about how certain elements are computed through emails?
Would this information be helpful to a) know that it exists, b) have it catalogued in the warehouse c) be able to integrate these elements within my BI reporting solution as "pop-overs" or pop-ups? This is all fine and dandy, by now the old-timer ETL jockeys say: I can write perl to conform this stuff to structured data, and load it in - why do I need EII?
Well, here's the case: What if over the following two minutes I answer two more questions (and the class is training) - EII can easily detect the new emails and provide the information in real-time to the training class. If I then add a word-doc to the central library that has FAQ's, then the class can make use of that information as well (immediately).
Granted, this is just one small case of solving a very specific problem - EII can solve many more problems like this, and much larger in scope, but it demonstrates a differentiator between EII and ETL.
Utilizing EII to access unstructured data will drive up ROI on integration projects at a much faster rate. Besides which, the ETL jockeys could use EII to help "discover" information about their integration projects - it may even help speed up the build-out process for EDW efforts.
Thoughts?
Dan L
Posted October 5, 2005 10:40 AM
Permalink | 1 Comment |




Dan,
In our EII customers and prospects, we see three kinds of unstructured/semi-structured information repeatedly: Word docs, XML docs, and search.
Word docs are useful for the analysis and opinion they contain. We have a large customer that uses extracts from Word docs inline with data from their EDW and ODS in a portal.
A good example of a tricky XML doc is a financial statement in XBRL. These usually contain over 2000 data elements per statement, and depending on what you are trying to analyze, you might only need a handful. So a more dynamic, query-based approach makes sense. We announced a partnership with Business Objects and EDGAR Online to do just this.
Search is something where we typically partner with a search company, or access Google via a Web Service. One nice thing you can do dynamically with EII is use the result of one query as the parameter for the search - get me the news stories on today's highest yield customers; I want to see if they have something in common the numbers alone aren't telling me.
I'll take a note and go into more depth on these in an upcoming blog.