Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« Is IT really putting business out of Business? | Main | Microsoft SSIS and SQLServer2005 »

Defining Unstructured Data & DW2.0

My last post discussed the notion of unstructured data being as much as 80% of the data that we in IT will / should begin to deal with. One of the readers requested that I expand on what I'm including in Unstructured Data. This entry discusses the types of structured/unstructured and semi-structured data as I see it. As usual, this pertains to business knowledge, and is a huge part of DW2.0. As it turns out, it also is (or will become) a huge part of changing IT from a cost center into a profit center; why? Because if we can integrate unstructured information, and glean the knowledge from it (determine contextual linkages), we can better understand where our business holes are.

There are three terms being bantered about in our DW2.0 world: Structured, Semi-Structured, and Unstructured data. Let's take a look at defining these terms and what they mean to us going forward.

Structured Data
Data that is sitting in a data store, defined by a catalogue (table definitions), something accessible via SQL, or data models, or Cobol Copybooks, or Object definitions. Data in rows and columns. Furthermore, this data has a characteristic of being contextualized by the heading (field name), and possibly defined in relation to other "fields". This data is also capable of being processed in a simple manner, summed, and aggregated and so on. What this data is NOT: is images, blobs, binary fields, free form documents, and so on.

Semi-Structured Data
Semi-Structured data seems to be that which houses structure with free-form elements, things like e-mails for instance, which have structure and context to specific elements in the header, but are free-form text documents in the body. Semi-structured comes in many forms, but it depends on what you are looking at as to whether or not the data is semi-structured or unstructured. For instance, semi-structured data for a fire-wall might be TCP/IP packets, where they care about the contents of the individual packet, along with a string of packets from the same IP to establish a pattern, and so on.

Unstructured Data
Unstructured data typically is all that which is not semi-structured or structured. For instance, images, this blog entry, content of web documents, standard documents, movies, audio, and so on.

What's the big difference? Why the hoopla? I thought Word Docs were structured!
Well, it all depends on your perspective. If the application is MS-Word, then in fact, the document itself is structured, however the CONTENT is not. Just like a web-page, the tags are structured, as are CSS elements, and XML, and HTML, and so on, the CONTENT is not.

Free form text (content) is NOT structured, until you are looking at a document which has sentences, and punctuation. Then, from a grammatical standpoint it is structured at a lower level of grain. But do you care? This is the big question. Just like we care about the grain of structured data, we should care about the grain of unstructured data.

We need to separate the terms: in an unstructured or semi-structured world, we need to make the choice: do we care about the "encapsulating structure" or do we care about the content or both? This is where the knowledge is, buried in the content, and doing something meaningful with the content.

Why?
Because unstructured and semi-structured and structured data are "one-and-the-same" when we talk about the encapsulating structure. All word docs for instance have markers, metadata, and processing instructions for Word to follow (layout, borders, size, color, font, etc..) All emails have standardized "structure", all images have specific processing instructions for standardized rendering engines, all audio (the same), all blobs, etc...

But when we talk about CONTENT, the playing field changes. Not all content is "the same", in other words, when you process a series of images, detecting when one is a face, one is a human, one is a tree, one is an ocean, etc... Determining WHAT the image is where the knowledge lies, and how it relates to other data based on WHAT it is - that's where the unstructured data processing lives.

Content derivation, assimilation, and integration is part of the story, once the content can be parsed, then hopefully basic outliers of context (important points) can be derived. In other words, like a search engine looking for key terms, but take it further than that: key terms that make sense or have relevance, ok: one more step further: not only have relevance but actually tie together what's duplicate, what's not, and learn from "elimination" of search results that the context is not relevant for those particular search terms...

This is just one example. Anyhow, all of this relates to DW2.0 and the stack within. Unstructured and semi-structured and structured data are NOT the same within a contextual sense, but are the same from within a structural encapsulation sense. In DW2.0, we must integrate the contextual information (meaning mine, and link together) in order to increase our awareness of what's going on in both the external and internal worlds of the corporation.

In order to make money, increase profits in IT, and actually provide more business value back to the business we MUST as IT professionals, undertake automation, and data mining of unstructured information, along with contextual integration as a step forward or we will lose sight of valuable information (particularly competitive).

As always, in the next blog I'll talk a little more about approaching IT automation, and how to integrate unstructured information into your enterprise from a DW2.0 perspective.

Please don't hesitate to comment, or ask questions.

Thank-you,
Dan Linstedt
Get your Masters of Science in Business Intelligence at: http://www.COBICC.com

  Posted by Dan Linstedt on March 21, 2007 4:38 AM |

Post a comment