Blog: Dan E. Linstedt« Why do business changes impact my EDW so much? | Main | How Data Models can Impact Business » Automated Derterministic Contextual CoefficientsInteresting thoughts abound around the issues of determining context for a finite state model. If you've ever considered the metadata stored within a "data model" then you may be partial to the discussion here today. What I've got is an engine that starts with structural definitions, these definitions exists in a finite state, sometimes they openly declare hierarchies, other times, they hide the associations and relationships in the naming conventions. However one "true-ism" must reside: if a human cannot make "sense" of the structure, then it must be unusable. Therefore it is my belief that there usually (in most cases) is a finite state of context which can be automatically determined based on the structure through a series of mathematical, statistical, and ontological approaches. These techniques are necessary in discovering the potential of unstructured data sets within DW2.0. Ok, what's all this geek-speak? In other words, they can _assist_ in placing the entire field / attribute into a hierarchical taxonomy. And further, can place the entity (table) into an over-arching ontology. So why do we spend all this time deciphering data models to "understand how our source systems store information" when we should be spending time deciding how to apply that information to a better more useful context? Isn't there some mathematical way to represent the language of attributes and entities to which we can tie common meanings and definitions? What does this mean? In other words, there is inherent meaning to the design (even if the design is encrypted) that makes sense or provides context to the data _at the grain at which it stands_ within the data model. For example, in English we read left to right. Typically things on the "left side" carry more importance than things on the "right side". In the case of forumlas, things on the "left side" are where the final answers are put, while things on the "right side" define how we get there (computation). Ok, so we have a field name: CAP_EXP_TOTAL What can we glean from this? Each abbreviation has a subsequent meaning. The "total" component may be a definition of a function, and context of the data might only be derived when looking at the taxonomy (table name) that encapsulates this information. If the table is a SALES table, then that might be computed one way, if the table is a FINANCIAL table then that total may be computed differently. We can prove this fact by profiling the information housed within the fields. I digress... What I'm discussing here, is the notion of gaining partial context from a semantic ontology layer, partially built from the model, and partially backed by "lookup" on definition, and the ontologies that each individual word might be housed in a larger ontology (from WordNet for example). Now, if I had a field called: We might not be able to discern what "type" of "type code" this thing is without looking at the enclosing entity. On the other hand, TYP_CD might actually be abbreviated this way: HULL_TYP_CD Each of these abbreviations share the general context of "type code". The fact that it is a "type code" is less important than what "type" of code the data is, and what grain the data lives in. The resulting abbreviation such as "TYP_CD" can then be generalized to a shared ontology, regardless of industry specific model, it can then also be represented by a finite set of definitions like: "TYPE_CODE, TYP_CODE, TP_CD, TYPE_CD" and so on. The shorter the abbreviations, the worse the confidence of an engine is to determine actual matches without looking at the data. Ok, so what does all this mean? We can achieve partial deterministic finite context for the base definition / storage of the data, and the actual congruence of the data set across multiple data models. It means we can provide 60% to 80% commonalities across data sets and data models around the world, it means potential standardization of grain and semantic layers. If you have a sample set of a data model you'd like me to analyze in this part of the blog, I'd be happy to give you my thoughts on the break-down and semantic meanings, we'll see how well I do without knowledge of your business. Cheers, |