Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Recently in Thought Experiments Category

I've been discussing DDW for quite a while; I've started discussing the nature of dynamic structure change. There are larger considerations out there that we need to think about before embarking down these paths. However that said - there are some applications regarding architectural mining and dynamic structure changes which I wish to discuss here. For those of you in the intelligence sectors of government or research and defense this may be of interest (or not). For those of you in DW / BI traditional, the only benefits that dynamic structure change might bring to you is the ability to adapt faster (on the back end) to the dynamic changes of business. But then again, this technology is years away (for all we know ;)

Dynamic re-structuring of structured data, why would we do it? What is the interest? What are the benefits?

Well, if you're in the intelligence sector, or identity analytics, or defense research, then this may hold some serious value - and perhaps, you are already performing these tasks - after all, DARPA began funding Nanotech and DNA computing experiments over 10+ years ago (at least as far as we can tell publicly). Enough said...

Anyhow, imagine a system beyond master data.. Where we have the structures that house specific "images" of data at a specific point in time, then we can stack those images and slice by time, or... slice by association.

What do you mean, slice by association?
What I mean is: imagine for a minute that a slice in time combined with a business key, surrounded by specific descriptors actually establishes a particular context. Now imagine that you have several hundred of these data points (multiple keys across the model), each is already imaged by "time", and quite possibly multiple time frames. Finally imagine that the business keys are essentially useless, except for hard mechanisms for people to indentify the information.

Now you are beginning down the path of something called identity analytics. Surround those keys with the notions of context, of course, using the term losely - in other words it is one view of information at a particular point in time, context is how you "rotate" the information to meet the needs of the current end user.

So you're saying "no relationships"?
yes, that is one of the things I'm saying. Now apply data mining across the information sets, and look for previously unknown patterns, but deeper than that - look for abstracted patterns of correlated data - where outliers aggregate in coincidental time frames - now this is presented to an end-user, in a visual 3 dimensional graph format.

We apply color to "Hot" "cold" and luke-warm correlations, the user applies the human thought process of "interest". By focusing in on the interested points, and applying human logic we could theoretically surf billions of contextual relationships that would otherwise go un-noticed.

Now, the human interaction establishes (interactively) the points of interest or the relationships that are associating the information to other points of information. Once this is done, a new set of data mining algorithms are run. These algorithms produce a specific answer, and test correlation of information to a more focused lens. This cycle can be run over and over again until the human decides that the relationship is of interest, and NOW can apply information relationships dynamically.

Once this relationship "falls out of interest" it is removed, in favor of a new relationship. In essence, the model becomes a slowly evolving model with human intervention. It's possible that after certain relationships have been identified, that the data mining algorithms can be "tuned" to self-modify parts of those relationships.

Well, all of this is just a thought experiment - the only part which may not necessarily be achievable today is the application of these changes to the queries, and load routines. Certainly without human interaction, zooming in to points of interest becomes a difficult task.

Identity analytics plays a role like this, in identifying context from information - then relating different "identities" as associated elements. But that's for another day.

I hope you found this entry interesting; I'd love to hear your thoughts.

Thanks,
Dan Linstedt
DanL@GeneseeAcademy.com


Posted October 30, 2007 2:50 PM
Permalink | No Comments |

I sat down with my good friend Jeff Jonas yesterday and discussed the nature and notion of contextual processing. Jeff is a phenomenal individual, and much smarter than I ever hope to be, but all that aside, we had a wonderful conversation about the nature of processing streaming data (one piece at a time, or possibly multiple pieces in parallel, but separated) and how to focus the notions of context.

How is this related to B.I.?
It has everything to do with Business Intelligence, and how we "experience" and use our data sets/patterns within to make sense of our business, especially in an Operational B.I. world

Processing the context on a streaming basis (as Jeff says) requires the ability to "change" all that we know (perception) at run-time based on new facts arriving on the stream. His statements went a little like this:

1) Imagine we think our friend XYZ is a good person. We just met this person 3 days ago, so we don't know much about them, but they've been nice to us - so our current perception of this individual is: K, U, I, O, T - and so on. We've hung out with them, so we have a whole host of experiences to draw from (mostly fun).
2) Now, 3 days later we find out from another very good friend, someone we've trusted for over 25 years, that this person has done something horrible in the past...

At that instant, considering our relationship to our very good friend, all that we know about person XYZ (perceptively) changes; usually very quickly.

Now, this isn't so bad if we are dealing with one piece of information, and a very small series of memories that we are focused on, but imagine now: trying to do this at 10,000 transactions per second in a non-sequential order of arrival of facts, and then trying to affect data sitting within 100 billion rows in our database...

This brings me to my discussion. From here Jeff and I began discussing HOW this processing needed to take place, and it reminded me of some of the conversations I'm having here at Teradata Partners conference this week.

The questions on the table are:
1) How should the system determine the assigned context for a given fact? Well, we have to let go of the word "context" and from a systems perspective we have to work with the notion that the data has a strong correlation to a particular STACK or SET of facts/history or historical knowledge.
2) Once a perspective has been established for that incoming fact, what IMPACT does it or should it have against all the target data, or patterns that are already known? For instance, suppose an area code changes from 720 to 750 (Jeff's example) - what do you need to do to change ALL of the existing phone numbers? Inserting brand new rows isn't always the answer, it would cause too much data change, updating existing information also won't work - it too would take too long. REMEMBER: 10,000 transactions per second, means we have to process this information and execute on the history in millisecond response times.

Jeff and I began to discuss the notions of a LENSE, through which focus on a particular pattern could be achieved. What's important here is the FOCUS - but again, remember the focus is for _this current piece of information_ and is not necessarily related to other currently arriving information or facts.

Well what the heck does this have to do with B.I.?
You should already be able to see it... In a VLDW where we have huge stores of time based information it is near impossible (without focus) to find what you're looking for, so the first problem is (again) establishing focus - where oh where does my data FIT? So if you're processing in REAL-TIME folks, listen up... Once we establish which data sets are affected, we need to understand IN A FRACTION OF A SECOND how to change the "known outcome" on the existing history - oh yes, and by the way, this all has to happen in PARALLEL with all the other arriving facts, or it simply won't be executed in a timely fashion.

Now what else am I saying about ALL THIS DATA we've stored?
HERE IT IS:

* Large volumes of data must be processed and learnt from.
* The combined "learned" knowledge (we'll call it a derivation on average) of a STACK of related information within a topic area IS MORE IMPORTANT than the parts or the all the history and individual facts, but without all the details, we can't create a combined image.
* This combined knowledge element must be used IN CONTEXT or AS A CONTEXT LENSE to quickly establish the relevance of the incoming information, and how it will affect the "next" view or look at the information.

In other words:
* VLDB / VLDW data by itself is important when you're digging for detailed specifics that happened at a specific point in time, but the real value is having a "mined" collective perspective on all that detail that allows us to establish where and how our current "transaction" will affect the outcome.

A 24x7x365 neural network / data mining engine MUST be up and running consistently. it must first be trained, and then constantly adjusted for "drift" off topic, but the neural net should be receiving the transaction inflow for "context" application in order to establish our focus, or put a "lens" of information to our historical data set. This isn't your fathers neural net, and not your mother’s data mining engine - no... this is a different way of "scoring" parts of interesting history that are within the interested perception bounds (Jeff's term) so that processing of "extraneous noise" is filtered away as one of the first steps.

This data "mining" engine or neural net is highly focused, real-time processing based on transactions, and it houses "the many different lenses" of focus (or combined derivations) of different but interesting views of history, so that based on the incoming transaction - it can change the "lens" to match and see where the impact is.

From a B.I. perspective, I'm also saying that the sum of the whole may be more interesting and more valuable than the sum of the parts, but to get the sum of the whole, we have to have all the parts when we start. So the INTELLIGENT part of Business Intelligence is all about
1) Knowing which patterns are most interesting / most costly to the business - establishing the RIGHT LENSE at the right time, and having that lens available ahead of the arrival of the transactions
2) understanding that changing the color of the lens is easy when the transaction arrives, but that over time, the "lens" needs to be replaced (due to virtual scratching / shifting of the answer set), and needs to be re-aligned with all the large set of facts included in the history.
3) real-time transaction processing IS 100% necessary in a VLDW / data warehousing environment.
4) ALL the facts that we collect are important, depending on the "viewing perspective" of the business user.

New kinds of systems like this are in development labs, and I can help you with your efforts (should you so desire) to focus the lens. But it's advances in technology beyond what we have today that make this so interesting.

Food for thought anyhow, I'd love to hear what you have to say.

Cheers,
Dan L
DanL@DanLinstedt.com


Posted October 9, 2007 7:33 AM
Permalink | 2 Comments |

Interesting thoughts abound around the issues of determining context for a finite state model. If you've ever considered the metadata stored within a "data model" then you may be partial to the discussion here today. What I've got is an engine that starts with structural definitions, these definitions exists in a finite state, sometimes they openly declare hierarchies, other times, they hide the associations and relationships in the naming conventions. However one "true-ism" must reside: if a human cannot make "sense" of the structure, then it must be unusable. Therefore it is my belief that there usually (in most cases) is a finite state of context which can be automatically determined based on the structure through a series of mathematical, statistical, and ontological approaches.

These techniques are necessary in discovering the potential of unstructured data sets within DW2.0.

Ok, what's all this geek-speak?
What I'm saying is this: Data Models are constructed from a fundamental need to "house/organize/and store" information in mass-quantities. Data Models (much like file-folders) have levels of grain both within, and across multiple structures. The elements within a data model are typically stated according to some "finite" naming convention. Things like abbreviations, annotations, squashed / shortened definitions, relationships (referential integrity) and so on determine partial definition of the element.

In other words, they can _assist_ in placing the entire field / attribute into a hierarchical taxonomy. And further, can place the entity (table) into an over-arching ontology. So why do we spend all this time deciphering data models to "understand how our source systems store information" when we should be spending time deciding how to apply that information to a better more useful context?

Isn't there some mathematical way to represent the language of attributes and entities to which we can tie common meanings and definitions?
Turns out there is... free sources (like ontology engines such as WordNet) can help with language definition and raw metadata understanding from around the world. I believe that by tying this information together with a formulated understanding of a given data model that we can begin to understand "what we really represent" inside our capture systems.

What does this mean?
Well, what I'm referring to is the science of taking "what-is" described, making sense of the context through ontologies and taxonomies, then applying those definitions against "what the business thinks is happening" This produces a significant GAP analysis, it can also be bounced against what data is "stored" in the system, and whether or not the information matches up with the basic goals of the taxonomy.

In other words, there is inherent meaning to the design (even if the design is encrypted) that makes sense or provides context to the data _at the grain at which it stands_ within the data model. For example, in English we read left to right. Typically things on the "left side" carry more importance than things on the "right side". In the case of forumlas, things on the "left side" are where the final answers are put, while things on the "right side" define how we get there (computation).

Ok, so we have a field name:

CAP_EXP_TOTAL

What can we glean from this?
If we assume that the ontology for this definition is in the line of finance, we might end-up expanding the abbreviations to:
CAPITAL
EXPENDITURE
TOTAL

Each abbreviation has a subsequent meaning. The "total" component may be a definition of a function, and context of the data might only be derived when looking at the taxonomy (table name) that encapsulates this information. If the table is a SALES table, then that might be computed one way, if the table is a FINANCIAL table then that total may be computed differently. We can prove this fact by profiling the information housed within the fields.

I digress... What I'm discussing here, is the notion of gaining partial context from a semantic ontology layer, partially built from the model, and partially backed by "lookup" on definition, and the ontologies that each individual word might be housed in a larger ontology (from WordNet for example).

Now, if I had a field called:
TYP_CD

We might not be able to discern what "type" of "type code" this thing is without looking at the enclosing entity. On the other hand, TYP_CD might actually be abbreviated this way:

HULL_TYP_CD
TAIL_TYP_CD
CUST_TYP_CD
or even:
ACCT_TYPE_CODE

Each of these abbreviations share the general context of "type code". The fact that it is a "type code" is less important than what "type" of code the data is, and what grain the data lives in. The resulting abbreviation such as "TYP_CD" can then be generalized to a shared ontology, regardless of industry specific model, it can then also be represented by a finite set of definitions like: "TYPE_CODE, TYP_CODE, TP_CD, TYPE_CD" and so on. The shorter the abbreviations, the worse the confidence of an engine is to determine actual matches without looking at the data.

Ok, so what does all this mean?
This means we can automatically examine the data, the model, the ontologies, and glean or construct: partial meaning, grain, global or local (shared definitions/specific definitions), overlap, and finally: we can mathematically optimize the STRUCTURAL MODEL we have in order to achieve a better result, a more common result where the metadata of the business shifts to the "applied function of the data" at run-time.

We can achieve partial deterministic finite context for the base definition / storage of the data, and the actual congruence of the data set across multiple data models. It means we can provide 60% to 80% commonalities across data sets and data models around the world, it means potential standardization of grain and semantic layers.

If you have a sample set of a data model you'd like me to analyze in this part of the blog, I'd be happy to give you my thoughts on the break-down and semantic meanings, we'll see how well I do without knowledge of your business.

Cheers,
Daniel Linstedt


Posted September 2, 2007 11:34 AM
Permalink | No Comments |

Every once in a great while I have a small idea bubbling around in my head. These ideas are things that seem like interesting points of view and make themsleves known either in the twighlight hours, or just before I'm about to fall asleep. Anyhow they are interesting ideas, and I am choosing to share them as they occur. They could be anything from observations to idiosyncratic comments, some may be useful while others will (hopefully) be forgotten. I ask you now, that if you like an idea that you spur the conversation by offering a comment or two, who knows? Maybe, just maybe it will be good....

Let's engage the warp drive past the horizon.


Posted August 14, 2007 10:26 PM
Permalink | No Comments |

1 2 NEXT

Search this blog
Categories ›
Archives ›
Recent Entries ›