Blog: Dan E. Linstedt« Is IT really putting business out of Business? | Main | Microsoft SSIS and SQLServer2005 » Defining Unstructured Data & DW2.0My last post discussed the notion of unstructured data being as much as 80% of the data that we in IT will / should begin to deal with. One of the readers requested that I expand on what I'm including in Unstructured Data. This entry discusses the types of structured/unstructured and semi-structured data as I see it. As usual, this pertains to business knowledge, and is a huge part of DW2.0. As it turns out, it also is (or will become) a huge part of changing IT from a cost center into a profit center; why? Because if we can integrate unstructured information, and glean the knowledge from it (determine contextual linkages), we can better understand where our business holes are. There are three terms being bantered about in our DW2.0 world: Structured, Semi-Structured, and Unstructured data. Let's take a look at defining these terms and what they mean to us going forward. Structured Data Semi-Structured Data Unstructured Data What's the big difference? Why the hoopla? I thought Word Docs were structured! Free form text (content) is NOT structured, until you are looking at a document which has sentences, and punctuation. Then, from a grammatical standpoint it is structured at a lower level of grain. But do you care? This is the big question. Just like we care about the grain of structured data, we should care about the grain of unstructured data. We need to separate the terms: in an unstructured or semi-structured world, we need to make the choice: do we care about the "encapsulating structure" or do we care about the content or both? This is where the knowledge is, buried in the content, and doing something meaningful with the content. Why? But when we talk about CONTENT, the playing field changes. Not all content is "the same", in other words, when you process a series of images, detecting when one is a face, one is a human, one is a tree, one is an ocean, etc... Determining WHAT the image is where the knowledge lies, and how it relates to other data based on WHAT it is - that's where the unstructured data processing lives. Content derivation, assimilation, and integration is part of the story, once the content can be parsed, then hopefully basic outliers of context (important points) can be derived. In other words, like a search engine looking for key terms, but take it further than that: key terms that make sense or have relevance, ok: one more step further: not only have relevance but actually tie together what's duplicate, what's not, and learn from "elimination" of search results that the context is not relevant for those particular search terms... This is just one example. Anyhow, all of this relates to DW2.0 and the stack within. Unstructured and semi-structured and structured data are NOT the same within a contextual sense, but are the same from within a structural encapsulation sense. In DW2.0, we must integrate the contextual information (meaning mine, and link together) in order to increase our awareness of what's going on in both the external and internal worlds of the corporation. In order to make money, increase profits in IT, and actually provide more business value back to the business we MUST as IT professionals, undertake automation, and data mining of unstructured information, along with contextual integration as a step forward or we will lose sight of valuable information (particularly competitive). As always, in the next blog I'll talk a little more about approaching IT automation, and how to integrate unstructured information into your enterprise from a DW2.0 perspective. Please don't hesitate to comment, or ask questions. Thank-you, |