Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« Why do business changes impact my EDW so much? | Main | How Data Models can Impact Business »

Automated Derterministic Contextual Coefficients

Interesting thoughts abound around the issues of determining context for a finite state model. If you've ever considered the metadata stored within a "data model" then you may be partial to the discussion here today. What I've got is an engine that starts with structural definitions, these definitions exists in a finite state, sometimes they openly declare hierarchies, other times, they hide the associations and relationships in the naming conventions. However one "true-ism" must reside: if a human cannot make "sense" of the structure, then it must be unusable. Therefore it is my belief that there usually (in most cases) is a finite state of context which can be automatically determined based on the structure through a series of mathematical, statistical, and ontological approaches.

These techniques are necessary in discovering the potential of unstructured data sets within DW2.0.

Ok, what's all this geek-speak?
What I'm saying is this: Data Models are constructed from a fundamental need to "house/organize/and store" information in mass-quantities. Data Models (much like file-folders) have levels of grain both within, and across multiple structures. The elements within a data model are typically stated according to some "finite" naming convention. Things like abbreviations, annotations, squashed / shortened definitions, relationships (referential integrity) and so on determine partial definition of the element.

In other words, they can _assist_ in placing the entire field / attribute into a hierarchical taxonomy. And further, can place the entity (table) into an over-arching ontology. So why do we spend all this time deciphering data models to "understand how our source systems store information" when we should be spending time deciding how to apply that information to a better more useful context?

Isn't there some mathematical way to represent the language of attributes and entities to which we can tie common meanings and definitions?
Turns out there is... free sources (like ontology engines such as WordNet) can help with language definition and raw metadata understanding from around the world. I believe that by tying this information together with a formulated understanding of a given data model that we can begin to understand "what we really represent" inside our capture systems.

What does this mean?
Well, what I'm referring to is the science of taking "what-is" described, making sense of the context through ontologies and taxonomies, then applying those definitions against "what the business thinks is happening" This produces a significant GAP analysis, it can also be bounced against what data is "stored" in the system, and whether or not the information matches up with the basic goals of the taxonomy.

In other words, there is inherent meaning to the design (even if the design is encrypted) that makes sense or provides context to the data _at the grain at which it stands_ within the data model. For example, in English we read left to right. Typically things on the "left side" carry more importance than things on the "right side". In the case of forumlas, things on the "left side" are where the final answers are put, while things on the "right side" define how we get there (computation).

Ok, so we have a field name:

CAP_EXP_TOTAL

What can we glean from this?
If we assume that the ontology for this definition is in the line of finance, we might end-up expanding the abbreviations to:
CAPITAL
EXPENDITURE
TOTAL

Each abbreviation has a subsequent meaning. The "total" component may be a definition of a function, and context of the data might only be derived when looking at the taxonomy (table name) that encapsulates this information. If the table is a SALES table, then that might be computed one way, if the table is a FINANCIAL table then that total may be computed differently. We can prove this fact by profiling the information housed within the fields.

I digress... What I'm discussing here, is the notion of gaining partial context from a semantic ontology layer, partially built from the model, and partially backed by "lookup" on definition, and the ontologies that each individual word might be housed in a larger ontology (from WordNet for example).

Now, if I had a field called:
TYP_CD

We might not be able to discern what "type" of "type code" this thing is without looking at the enclosing entity. On the other hand, TYP_CD might actually be abbreviated this way:

HULL_TYP_CD
TAIL_TYP_CD
CUST_TYP_CD
or even:
ACCT_TYPE_CODE

Each of these abbreviations share the general context of "type code". The fact that it is a "type code" is less important than what "type" of code the data is, and what grain the data lives in. The resulting abbreviation such as "TYP_CD" can then be generalized to a shared ontology, regardless of industry specific model, it can then also be represented by a finite set of definitions like: "TYPE_CODE, TYP_CODE, TP_CD, TYPE_CD" and so on. The shorter the abbreviations, the worse the confidence of an engine is to determine actual matches without looking at the data.

Ok, so what does all this mean?
This means we can automatically examine the data, the model, the ontologies, and glean or construct: partial meaning, grain, global or local (shared definitions/specific definitions), overlap, and finally: we can mathematically optimize the STRUCTURAL MODEL we have in order to achieve a better result, a more common result where the metadata of the business shifts to the "applied function of the data" at run-time.

We can achieve partial deterministic finite context for the base definition / storage of the data, and the actual congruence of the data set across multiple data models. It means we can provide 60% to 80% commonalities across data sets and data models around the world, it means potential standardization of grain and semantic layers.

If you have a sample set of a data model you'd like me to analyze in this part of the blog, I'd be happy to give you my thoughts on the break-down and semantic meanings, we'll see how well I do without knowledge of your business.

Cheers,
Daniel Linstedt

  Posted by Dan Linstedt on September 2, 2007 11:34 AM |

Post a comment