Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

December 2007 Archives

I've recently begun research in to this area, and am calling this "Automorphic data models" rather than dynamic data warehousing, because I think the concept lends itself better to this kind of term. Dynamic Data Warehousing seems to be an overly-used slightly abused term in the industry, and raises quite a few questions as to how, and what it is. Vendors are also using this term to mean different things. We'll let the business and the vendors work out their definition of this term over the next few years. I'm going to write exclusively (for a while - in this section) on Automorphic Data Modeling. These entries are aimed at the researches and the scientific people in the audience.

First, I must apologize to all those who _really_ know this stuff. I am an architect, and an Information modeler at heart. I believe these connections exist to the Data Model architecture I wrote up called the Data Vault Model, because it is based in spatial-temporal mathematics, and because it is based on the "poor mans definition of how the brain MIGHT store/use/retrieve information." Based on these hypothesis, I can see where the mathematics of these types apply to the model. I'd love to hear from those of you as to why these theories will or won't work, it will be interesting to see how this progresses.

If we start with Websters definition of Automorphic we end up with the following:

Patterned after one's self.

The conception which any one frames of another's mind is more or less after the pattern of his own mind, -- is automorphic. --H. Spenser.
http://dictionary.reference.com/browse/automorphic

However, I prefer the mathematical definition of Automorphism:

In mathematics, an automorphism is an isomorphism from a mathematical object to itself. It is, in some sense, symmetry of the object, and a way of mapping the object to itself while preserving all of its structure. The set of all automorphisms of an object forms a group, called the automorphism group. It is, loosely speaking, the symmetry group of the object. http://encyclopedia.thefreedictionary.com/automorphism

Automorphic Groups: (Which is what I'd suggest the Data Vault model is built from)

In mathematics, the general notion of automorphic form is the extension to analytic functions, perhaps of several complex variables, of the theory of modular forms. It is in terms of a Lie group , to generalize the groups SL2(R) or PSL2 (R) of modular forms, and a discrete group , to generalize the modular group, or one of its congruence subgroups. The formulation requires the general notion of factor of automorphy for , which is a type of 1-cocycle in the language of group cohomology. The values of may be complex numbers, or in fact complex square matrices, corresponding to the possibility of vector-valued automorphic forms. The cocycle condition imposed on the factor of automorphy is something that can be routinely checked, when is derived from a Jacobian matrix, by means of the chain rule. http://encyclopedia.thefreedictionary.com/Automorphic+form


Essentially what we are doing within the Data Vault data model is a form of Automorphism. The Data Vault modeling structures are built The Data Vault Model is actually based on many different components of temporal mathematics and spatial mathematics. (I've listed a few of the research papers I used in the 1990's to help me construct the structural integrity of the Data Vault):

1. “Unifying Temporal Data Models via a Conceptual Model”, http://www.cs.arizona.edu/~rts/pubs/ISDec94.pdf
2. “Notions of Upward Compatibility of Temporal Query Language”, http://www.cs.arizona.edu/~rts/pubs/Wirtschafts.pdf
3. “Temporal Data Management”, http://oldwww.cs.aau.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-17.pdf
4. “Spatio-Temporal Data Types: An Approach to Modeling and Querying”, http://web.engr.oregonstate.edu/~erwig/papers/MovingObjects_GEOINF99.pdf
5. “Formal Semantics for Time in Databases”, http://portal.acm.org/citation.cfm?id=319986&coll=portal&dl=ACM&CFID=6511873&CFTOKEN=58729889

The Data Vault model is capable of adapting, changing on the fly, exhibiting the mathematical properties of automorphism, in that through architecture mining combined with data mining efforts we can "learn" what architecture flaws exist, where stronger relationships exist, and where the architecture can change itself or re-connect to itself to form a stronger data model.

How does this work?
The Data Vault LINK is made up of vectors. It houses a directional connection to each HUB that it is associated with. The vector of that connection can be assigned a strength and confidence co-efficient to determine it's usefulness within the data set contained within a link. Mining the data over time can produce a powerful combination of patterns of change, along with the discovery of patterns of association (possibly never known before), or as a result of a pre-known state.

The data mining tool can then be taught either "what to look for", or it can be set-off in discovery mode to associate information based on a Data Vault model already constructed (use the existing Data Vault model as a starting point for the learning process), and then it can determine if any "undiscovered" relationships exist. Furthermore the process of mining the data can then be used to assign strength and confidence coefficients to EACH of the vectors in each link, thus preparing for the architectural mining phase.

So how is the Data Vault automorphic?
The Data Vault is connected (within itself) to itself via the Links and the vectors within the links. Each vector can be considered a component within the mathematical matrix of the automorphic functions. Then, the mathematics of "groups" and vector analysis can be applied to dynamically alter the matrix for a potentially different outcome.

Thus, new links can be constructed on the fly, tested, and then removed (if no real value to the human on the other end of the computer is perceived). They can likewise be constructed, and then old linkages can be removed to produce an auto - morphing data structure, something akin to self-correcting. I will NOT go so far as to say it's actually learning, because it (the computer) will still not understand the CONTEXT to which it's applying the changes.

This type of system STILL requires guidance, training, and tweaking from the operators in order to achieve the desired outcomes and modifications to the model that make sense to the business, even if the business itself is commercial or government oriented.

However, this type of system can be applied (easily) directly to the Data Vault modeling constructs in order to achieve a self-changing data store, something that appears to "point-out" different facts, or discover different unknown relationships without understanding what it has. The understanding part is still up to the human.

Ok, so how does this benefit business?
Well, if we can spot relationship changes automatically (on a simplistic level), or mathematically figure out POSSIBLE infancies in our business, then we might be able to adapt our business based on the information being collected (or in some cases, adapt the source systems to collect information WE MISSED that might be vital to our operations). The data sets and the architecture of the data sets can tell us as much about our business as the processes and the business models we use.

You can find out more about the Data Vault model (for free) at: http://www.DanLinstedt.com

Hope this is interesting,
Daniel Linstedt


Posted December 23, 2007 6:32 AM
Permalink | No Comments |

There are problems in I.T. today with a lack of agility. There are issues with Business creating their own spread-marts in MS-Access, Excel, and OLAP Cubes. There is a widening gap between the "corporate Enterprise Data Warehouse" and what I.T. can provide, how quickly they can adapt, and how cost effective they can be going forward. There is a rise of something called 2nd generation data warehouses... Why? Because 1st generation warehouses are suffering from "stove-piped solutions" re-created by using the incorrect modeling techniques for your data warehouse. Bill Inmon has been writing lately about data modeling and how to do it properly. In this entry I'll dive in head first into these issues, and what's going on in the industry, and what you can do about it.

What' are you talking about "Willis..."?
(From an American '80's TV show called: Different Strokes)
I've seen it happening first hand at many many clients. The typical story is as follows:

First the company select "star-schema" modeling as the way to build their enterprise data warehouse. Then, they select conformed dimensions, and shared fact tables. The first implementation costs the business 90 days and maybe 5 consultants, and maybe $250,000 USD. If your lucky, it might be $150,000.

The business unit that this is built for becomes very happy, with quick delivery, apparently low cost, and super fast access to dimensional information that meets their business needs... But then, reality sets in... Other business units see this success, and want "one of their own" built.

There's trouble on the horizon sailor...
What do you mean? I don't see any trouble... Well, to tell you the truth, building a second or even a third "star schema" and then federating these together doesn't seem to be such a big deal. The cost may increase only slightly to maybe $180k or $275k, and the number of days to implement may increase only slightly to maybe 110-120 days. But what's happening here?

The reality of it is: I.T. (because of business needs) takes existing dimensions and begins to add different & loosely affiliated information to the same "dimension", thus, apparently attempting to "conform it."

So what's the bottom line?
As this process continues, and I.T. gets in to the 5th or 6th "project", the conformity of the dimensions becomes lost in the fray. Too many different kinds of data are added to the dimension "to conform it to the enterprise" which distorts it's original purpose, and in fact (if done improperly) can destroy the grain of the business key and fact tables to which the dimension is "hooked up." But more importantly, each time I.T. increases the size of this monster, it always creeps in to higher cost, and longer implementation time-frames.

I.T. becomes less agile in their implementation strategies, and a simple "change" that the business has to make (that used to cost $150k and take 90 days) now costs well in to the $350k range and takes 6 months or more. What was a conformed dimension now becomes a "deformed" dimension, and has trouble meeting the business needs.

What's the business impact of all of this?
The business begins to wonder how effective the "EDW" solution is... They need the changes made in order to stay effective in business, and since I.T. costs to much and takes to long, they make a copy of existing data sets, and build their own MS-Access databases, and Excel spreadsheets. The flip side is: I.T. mitigates this by beginning to construct singular star-schemas (back to quick delivery and smaller cost), which now means I.T. is reconstructing the stove pipes that they were trying to eliminate in the first place!!

You have one of two outcomes:
1) Wow! This is HUGE, I never realized - but this is exactly what's happening in my business... (You are now teetering on the brink of disaster unless you enter DW2.0, and 2nd generation warehousing).
or
2) I.T. has followed the strictest of standards all along, the system hasn't really "grown" beyond maintenance control costs because the data modelers did their job properly and volume, real-time, and compliance aren't issues here... So I see no reason to look for solutions. Congratulations, you're one of the few and I'd love to hear from you as to how you got your success to work.

How can we solve this problem?
There are several solutions to these problems, but they all stem from choosing the right data modeling architecture for your EDW, along with a solid foundational architecture and framework with which to build your system.

The short answer is to look at becoming DW2.0 compliant. DW2.0 brings with it fundamental tenants that we should adhere to in order to put the right architectural components in place. It also comes with the standard definitions that the industry has lacked over the years, finally and at last we have standards, definitions, and frameworks to follow. The second part of the short answer is that you need the right data model under the covers to make scalability, flexibility, and compliance a top-notch effort.

There are several different data modeling ideas and techniques floating around out there (all of them built off other's ideas). They can help you overcome these pains you may be feeling.

The first (and my favorite - but then again, I'm biased) is called the Data Vault, it's real name is Common Foundational Integration Model Architecture (but that doesn't sell). It's been10+ years of R&D from 1990 to 2000, it's been available for free since 2000, it's been endorsed last summer as:

"“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.”, June 2007, Bill Inmon

It's currently in use in many BIG industries around the world, and has a community for it's seekers.

Data Vault "customers" (I use the term loosely because they pay for consulting, not for the modeling technique which is free), have repeatedly told me how it really helps make I.T. agile again, how it's made I.T. stars in the business eyes because it has lowered cost, and reduced time frames to delivery, and gotten I.T. back in to the EDW game of providing value to the business by keeping up with the changes. The Data Vault has proven itself to be the technique of 2nd Generation EDW efforts around the world.

http://www.DanLinstedt.com

Many customers (after following the whole approach on Data Vault) can actually produce Star Schemas for business (as an output/delivery mechanism) in about 24 hours turn-around from the time they receive their business requirements to the time full delivery occurs. If your lucky, then prototypes with data can be turned around in less than 1 hour.

There's another technique (which if you take the Data Vault model to 6th normal form-likeness, looks very similar) called Anchor Modeling from a company in the Netherlands called Intellibis. The creators say it's been around since 2002, I cannot yet find any customer stories, or reasons to normalize to that level, but it appears to possibly hold promise in the future. It's a different way of thinking.

http://www.intellibis.se/

And finally, an interesting thought pattern that has yet to be put into practice (as far as I can tell) is something called the Triadic Continuum. There appears to be a book written about the subject that approaches the theoretical (and some concrete) aspects of what this means. However it appears to be stated as a way to help machines understand context - I'm not sure exactly (yet) how this applies to EDW.

http://jeffjonas.typepad.com/jeff_jonas/2007/11/the-triadic-con.html
http://www.beyeblogs.com/sharpeningstones/archive/2007/12/triadic_continuum_prolog.php
http://www.eruditor.com/books/item/9780595441129.html.en (the book)

You'll hear more from me in this upcoming series on data modeling, and architecture (and how it affects your business) as we move forward. In the mean time, check out the modeling efforts noted above, or respond to this entry, tell us what you are engaged in and how successful it is, or where some of the pains you are experiencing exist.

Thank-you,
Daniel Linstedt


Posted December 21, 2007 3:59 AM
Permalink | 2 Comments |