We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Recently in Dynamic Data Warehousing Category

Welcome to the next installment of Data Vault Modeling and Methodology.  In this entry I will attempt to address the comment I received on the last entry surrounding Data Vault and Master Data.  I will continue posting as much information as I can to help spread the knowledge for those of you still questioning and considering the Data Vault.  I will also try to share more success stories as we go, as much of my industry knowledge has been accrued in the field - actually building systems that have turned in to successes over the years.

Ok, let's discuss the health-care provider space, conceptually managed data and master data sets, and a few other things along the way.

I have a great deal of experience in building Data Vaults to assist in managing health-care solutions.  I helped build a solution at Blue-Cross Blue Shield (Wellpoint St. Louis), another Data Vault was built and used for a part of the Center for Medicare and Medicaid facilities in Washington DC.  Another Data Vault is currently being built for the US Government Congressionally mandated health-care electronic records systems for helping track US Service personell, and there are quite a few more in this space that I cannot mention or discuss.

Anyhow, what's this got to do with Data Vault Modeling and Building Data Warehouses for chaotic systems, or immature organizations?

Well - let's see if we can cover this for you.  First, realize that we are discussing the Data Vault Data Modeling constructs (Hub & Spoke) here, we are not addressing the methodology components - that can come later if you like (although having said that, I will introduce parts of the project that help with parallel yet independent team efforts, that meet or link together at the end.

Ok, so how does Data Vault Modeling truly work?

It starts with the Business Key, or should I say the multiple business keys.  The business keys are the true identifiers of the information that lives and breathes at the finger tips of our applications.  These keys are what the business users apply to locate records, and uniquely identify records across multiple systems.  There are plenty of keys to go around, and source systems often disagree as to what the keys mean, how they are entered, how they are used, and even what they represent.  You can have keys that look the same that represent two different individuals, or you can have two of the same key that SHOULD represent the same individual, but the details (for whatever reason) are different in each operational system, or you can have two of the same keys representing duplicate records across multiple systems (best case scenario).

These business keys are the HUBS or HUB entities.  If you wanted, the different project teams can build their own Data Vault models constructed from business keys in their own systems.  Once the data is loaded to a historical store (or multiple stores), you can then build Links across the business keys to represent "same-as" keys: ie: keys that look the same, that the business user defines to be the same, but the data disagrees with itself.

Remember, links are transitionary, they represent the state of the current business relationships today.  They change over time, links come and go - they are the fluid dynamic portion of the Data Vault - making "changes to structure" a real possibility... but I digress.

To get the different teams building in parallel (their own Data Vaults) is the first step.  Once they have build Hubs, Links, and their own Satellites - and they are loading and storing historical data, then a "master data" team can begin to attack the cross-links from a corporate standpoint.  This must be done in much the same manner as building a corporate ontology - different definitions for different parts of the organization, even for different levels within the organization.   The Master Data Team can build the cross-links to provide the "corporate view" to the corporate customers, with the appropriate corporate definitions.

Think back to a scale-free architecture, it's often built like a B+ or binary tree, where nodes are inside of nodes, and other nodes are stacked on top of nodes, etc...  So... we have Data Vault Warehouse A, and Data Vault Warehouse B - now, we need Corporate Data Vault Warehouse C to span the two....  Links are the secret, followed by Satellites on the Links.  There may (as a result of a spread-sheet or two used at the corporate levels) even be some newly added Hub keys.  Again, business keys used at the corporate level that are not used at any other level of the organization.

Finally, at long last - a good use for Ontologies marrying to Enterprise Data Warehouses.  By the way, this is also the manner in which you develop a Master Data Set.  Don't forget that MDM means Master Data Management - and MDM includes people, process, and technology.  The Data Vault only provides the means to easily construct Master Data - it is NOT an MDM solution, strictly an MD "Master Data" solution.

Governance doesn't have to be separate, doesn't have to come before or after the Data Vaults are built - and again, disparate EDW Data Vaults can be built by parallel teaming efforts.  However - that said, once you embark on building Master Data sets, you *MUST* have governance in place to define the Ontology, the access paths, the corporate view (corporate Links & Hubs & Satellites) that you want in the Master Data components.

In essence you are using the Data Vault componentry (from the data modeling side) to bridge the lower level Data Vaults - to feed back to operational systems (that's where Master Data begins to hit ROI if done properly), and to provide feeds to corporate data marts.

In fact, we are using this very same technique in a certain organization to protect specific data (keep some data in the classified world) while other data lives in the non-classified or commercial world.  Scale free architecture works in many ways, and the LINK table (aside from adding joins), is the sole reason that makes this possible, and is what makes the Data Vault Model fluid.

It's also what helps IT to be more agile/more responsive to business needs going forward.  The Link table houses the dynamic ability to adapt quickly, change on the fly.

I'm not sure if I mentioned it, but ING Real Estate is using Excel Spreadsheets through Microsoft Sharepoint to trigger Link changes and structural changes to the Data Vault on the fly.  Thus when spreadsheets change and the relationships change, the Link tables change - leaving the existing history in tact, and creating new joins/new links for future historical collection.  But this is yet another example of dynamic alteration of structure (on the fly) that is helping companies overcome many obstacles.

But I ramble, there's another company: Tyson Foods who has a very small Data Vault absorbing XML, and XSD information from 50 or so external feeds?  Most of which change on a bi-weekly basis.  They had one team who built this as a pilot project using the Data Vault, and are now adapting easily and quickly to any of the external feed changes coming their way.  In fact, they were able to apply the "master data/governance" concepts at the Data level, and "clean up" the XML quality of the feeds they were re-distributing to back to their suppliers.

So let me bring it home:  Is Governance and clean-up required up-front to build a Data Vault?

No, not now, not ever.   Is it a good thing? Well - maybe, but by loading the data you do have into disparate Data Vaults, you can quickly and easily discover just where the business rules are broken, and where the applications don't synchronize when they are supposed to.  Can the Data Vault Model help you in building your MDM?  Yes, but's it's only a step in the Master Data side of the house.  You are still responsible for the "Data Management" part of MDM, the people, process, and technology - including the Governance...  all part of Project Management at a corporate level.

This brings the second segment to a close.  Love to have your feedback, what else about the Data Vault are you interested in?  Again, these are meant to be high level - to explain the concepts.  Let me know if I'm meeting your needs.  Feel free to contact me directly.

Thank-you kindly,

Dan L,  DanL@DanLinstedt.com

Posted March 15, 2010 6:39 PM
Permalink | 1 Comment |

Most of you by now have heard the words: "Data Vault".  When you run it through your favorite search engine you get all kinds of different hits/definitions.  No surprise.  So what is it that I'm referring to when I discuss "Data Vault" with BI and EDW audiences?

This entry will try to answer such basic questions, just to provide a foundation of knowledge with which to build your fact finding on.

Data Vault: Definitions vary - from security devices, to appliances that scramble your data, to other services that offer to "lock it up" for you...  That's NOT what I'm discussing.

I define the Data Vault as follows:  Two basic components:

COMPONENT 1: The Data Vault Model

The Modeling component is really (quite simply) a mostly normalized hub and spoke data model design with table structures that allow flexibility, scalability, and auditability at it's core.

COMPONENT 2: The Data Vault Methodology

I've written a lot less about this piece.  BUT: This piece is basically a project management component (project plan) + implementation standards + templates + data flow diagrams + statement of work objects + roles & responsibilities + dependencies + risk analysis + mitigation strategies + level of effort guestimations + predictive type / expected outcomes + project binder, etc....

What's so special about that?

Well - what's special about the methodology is that it combines the best practices of six sigma, TQM, SEI/CMMI Level 5 (people and process automation/optimization), and PMP best practices (project reviews, etc..).  Is it overkill?  for some projects, yes, for others - no.  It depends on how mature the culture of your organization is, and how far along the maturity path IT is - whether or not they are bound or decreed to create then optimize the creation of enterprise data warehouses.

Ok - the project sounds a lot like "too huge to handle"  - old, cumbersome, too big, too massive an infrastructure.  etc.. etc.. etc...  Yea, I've heard it all before, and quite frankly I'm sick of it.

I built a project this way in the 1990's for Lockheed Martin Astronautics called the Manufacturing Information Delivery System (MIDS / MIDW) for short which last I heard is still standing, still providing value, still growing today.  I was an employee for them under their EIS (enterprise Information Systems) company.  My funding came from project levels, specifically through contracts.  I couldn't get time from a fellow IT worker without giving them my project Charge Number  (Yes, CHARGEBACKS).  So every minute we burned was monitored and optimized.  We built this enterprise data warehouse in 6 months total with a core team of 3 people (me, a DBA, and a SME).  We had a part time data architect/data modeler helping us out.  We wrote all our code in COBOL, SQL, and PERL scripts.  Our DEC/ALPHA mainframe was one of our web-servers, so we wrote scripts that generated HTML every 5 minutes to let our users know when our reports were ready.

Ok - technology has come a long long way since then, but the point is: we used this methodology successfully with limited time, and limited resources.  We combined both waterfall and spiral project methodologies to produce a repeatable project for enterprise data warehouse build-out.  At the end of the project we were able to scale out our teams from our lessons learned, optimize our IT processes, and produce more successes in an agile time frame.  We had a 2 page business requirements document - that once the business user filled in and handed back to us, to the time we delivered a new star schema was approximately 45 minutes to 1 hour.  ** as long as the data was already in the data warehouse, and we didn't have to source a new system**

This is efficiency.  We had a backlog of work from around the company because we had quick turn-around.  Is this Agile?  Don't know - all I know is it was fast and Business Users Loved it.

Anyhow, off track - so let's get back.

The methodology is what drove the team to success - allowed us to learn from our mistakes, correct and optimize our IT business processes, manage risk, and apply the appropriate mitigation strategies.  We actually got to a point where we began turning a profit for our initial stakeholders (they were re-selling our efforts to other business units, bringing in multiple funding projects across the companies because of our turn around time).  The first project integrated 4 major systems: Finance, HR, Planning, and Manufacturing.  The second project integrated Re-work, Contracts, and a few others like launch-pad parts.

Anyhow, at the heart of the methodology was and is a good (I like to think it's great) data architecture.  The Data Vault Modeling components.

This is just the introduction, there is more to come - I really am counting on your feedback to drive the next set of blog entries, so please comment on what you'd like to hear about, what you have heard (good/bad/indifferent) about the Data Vault Model and / or methodology.  Or contact me directly with your questions - as always I'll try to answer them.


Dan Linstedt


Posted March 14, 2010 8:52 PM
Permalink | 3 Comments |

I've been working heads down quite a bit lately on building new releases, and of course on new research and design.  I appologize for the silence on my blog to all my faithful readers.  The good news is that Data Vault Data Modeling is taking off in the world, mostly due to compliance, governance, and auditability requirements faced by major industries.  You can follow this on http://www.DataVaultInstitute.com - free forums

On another note this entry will explore some of the R&D notions that I'm currently developing.

Automorphic design...  Hmmm, lots and lots of different things come to mind.  Lately I've been experimenting with "self-adapting" data models, Artificial Intelligence, learning systems, neural capacities of the brain (for learning new ideas, categorization theory), and so on...  Just a hodge-podge or mish-mash of activity centered around the capture, representation of conceptual thinking - along with the ability to "mine structures" to figure out more optimal models, or alternative points of view.

If you aren't using the Data Vault Modeling today, or you haven't heard about it, then this entry might not make a lot of sense.  You can check it out or come to one of our certification courses at: http://www.DataVaultInstitute.com

Now let's break from the traditional discussions for a moment, and think about "memories" and how humans learn new ideas.  *** PLEASE NOTE: THIS IS A HOBBY OF MINE, I AM NOT OFFICIALLY CERTIFIED IN NEURAL TECHNOLOGIES, I AM NOT A NEURAL SURGEON, NOR DO I STAKE CLAIM TO UNDERSTANDING HUMAN PSYCHOLOGY ***

These are just ideas, thoughts, and opinions.  With that, let's be on our way.

If I give you a date on a calendar, especially one that has significance in your life, then you would start recalling memories around that date.  This is representative (I believe) of a HUB or a business key.  It's also representative of a primary identifier to a base-concept or point of view.  The memories you recall may end up being interpreted differently than a person who was with you on that date, this is the notion of contextual thinking.  You have assimilated facts about that day (maybe you remember a partly cloudy sky when your friend remembers it to be sunny) that now are interpreted based on how you felt about that day.

Ok, Data Vault Models have a similar construct, note: NOT a similar function (at least not yet).  The HUB is the "key" or the unique identifier which allows us to construct or establish a base access point to a specific conceptual idea for a point in time.  The Satellites around the HUB establish the "contextual facts" or memories across multiple time-spans.  How we interpret these facts can be seen as an AI activity/application. 

How we view these facts will all depend on our "point-of-view/reference" with which we are querying these facts.

They say, that when we learn something new - we establish new dendrites/synapses that connect neurons together.  They also say, that the more we think about it, the stronger the electrical impulses, and the thicker these connections get.  They then say, at a certain point (for the most part) these memories become "permanent" as they are comitted to long-term memory.  Finally, they say we never stop learning....  Well, we also dream at night, some argue that dreaming is a form of assimilation of short-term memories (categorization, organization, and attachment) to long-term memories.  They also suggest that this is a notion of learning, but also a notion of establishment of context - your frame of reference about the way you live your life.  Considering these statements, one could suppose that this contributes to character and the way you live your life.

Back to data modeling.  How is this implemented in a data warehouse?  And what REALLY is Business Intelligence?  Esoterical questions to say the least!  Ok, we'll take a shot at it.

A data warehouse (built on the Data Vault) is like short-term and long term memory, it commits FACTS (sights, sounds, colors, smells, tastes) to long term memory, it (hopefully) organizes it according to your business point-of-view (Line of business or industry your in).  non-important facts should be tossed out.  Important and auditable facts should be committed to permanent (enterprise) memory.  They should be KEYED by business keys for quick reference and searchability.  The Hubs are then surrounded by "fact data" in satellites to determine a CONCEPT.  These CONCEPTS are then LINKED together by associations, the associations have concepts as well that describe these notions.  The resulting Data Vault model acts as a giant Ontology organizer, the POINT-OF-VIEW can be pivoted based on the HUB that someone is interested in.

Well, this is all very well and nice, but how does this relate to auto-changing data models?

The hope or plan is: that AI algorithms that "pair-up" ontologies, and optimize references.  In other words, we build a META DATA (or structural) mining algorithm which uses the data mining results to "stand-up" or "tear-down" the assumptions about the structural linkages.  At the end of the day, this "algorithm" watches queries, and load patterns and secondarily applies DATA patterns (for strength and confidence ratings) to establish new linkages, tear down old linkages, and evolve the model.

The Data Vault Model is unique in this way, that it can be applied to "dynamic changes" to structure without too much hassle, and in fact is proving very effective in this area.  The thought process is: with manual occasional correction to the AI learning algorithm, the model will self-adapt and evolve as the business changes, and needs change, and point-of-view or reference changes.

I've already had one business where I applied my knowledge against the model to create a Link table that "previously didn't exist", and the business gained a 40% revenue increase through access to additional insights that they never had before.

Ok, enough babble for now... what does this mean to my EDW?

Nothing really, at least not today.  But it does mean that you should be looking at the Data Vault Modeling components in order to achieve these dynamic benefits in the future.

As always, I'm open to your thoughts, comments and opinions - did I end up in left-field here or do you see this as important?


Dan Linstedt  danl@danlinstedt.com

Posted May 5, 2009 8:22 AM
Permalink | No Comments |

I've blogged about this topic for many years now, my first mention of it was in my www.TDAN.com articles regarding the Data Vault Modeling architecture. However, that said, I've been blogging on everything from autonomic data models, to dynamic data warehousing, but in my research, I've come to realize I've left out some very critical components. I've lately been experimenting with building a self-adapting structured data warehouse. There are many moving pieces and not all the experiments are finished, so I cannot write (yet) about any of the findings. But here, I'll expose some more of the under-belly as it were that is necessary to make DDW a reality (in my labs anyhow)....

I've tried and tried to find a new name for this thing, but alas, it just seems to elude me. Dynamic Data Warehousing seems to have a nice ring, and is quite the nice fit. The term however evokes all kinds of different meanings to different companies and different people. So much so, that I've had open discussions with IBM in the past about their use of the term! Oh-well, water under the bridge.

But that brings me to my next point. There are missing components to my definition of DDW, I didn't get it all, and I'm sure that this is just another step in the definition (that the definition will not be completed for another year or two). If I look back at what's going on I see the following:

Convergence of:
* Operational Processing and Data Warehousing.
* Master Data and Metadata to use the Master Data Properly
* Tactical decisions backed by strategic result sets
* Business, Technical, Architectural, and Process Metadata
* Real-Time and Batch processing
* Standard reporting technologies and "Live animated scenarios" with walk-throughs and 3D imagry
* Human-machine interfaces
* MPP RDBMS systems and Column Based Database solutions

Why then shouldn't we see convergence of "data models" and "business processes"?
or "Data Models" and "Systems Architecture"?

The point is: WE ARE. (or at least I am). Not only is this happening in my labs, but It's being requested of me when I visit client sites. The customers want "1 solution", or better yet, they want a solution that "appears to learn" based on the demands put upon the system.

Why do I say "appears to learn?"
because Machine learning and appearances of machines translating context are two totally different things. I cannot and will not claim to have made a machine to think. However, I can and have made a machine's enterprise data warehouse responsive to external stimulous - at least when it comes to the data model, loading routines, and queries. Please do NOT mistake this as anything more than AI applied in a new manner - mining metadata (structure and queries and load-code and web-services) rather than just mining data sets themselves. (more on that later, much later --- I still have a LOT of research to do).

Ok - so what's missing from the Dynamic Data Warehouse definition?
* Use of metadata: business, technical, and process during the model learning/adaptation phase
* Use of an ontology (part of business and technical metadata as described above)
* Use of a training model, all good neural nets need to be trained over time, and then corrected.
* Use of the queries to examine and compare HOW the data sets are being used and accessed against the current data model
* Use of a minimal load-code parser, again to assist in training the neural net to recognize the correct structure.

Anyhow you get the point. Dynamic Data Warehousing is about a back office system, that responds to changes in the structured data world - as the queries change then the indexes change. As the incomming data set changes, the model needs to change. Some queries (if consistent enough) can actually express new relationships that need to be built.

This is an adaptable system, this is a dynamic system, this will eventually become a true Dynamic Data Warehouse.

Dan Linstedt

Posted September 21, 2008 9:52 PM
Permalink | 3 Comments |

In my last entry in this category, I described automorphic data models and how the Data Vault modeling components is one of the architectures/data models that will support dynamic adaptation of structure. In this entry I will discuss a little bit about the research I'm currently involved in, and how I am working towards a prototype of making this technology work.

If you're not interested in the Data Vault model, or you don't care about "Dynamic Data Warehousing" then this entry is not for you.

The Data Vault model has reached the height of flexibility by applying the Link tables. It is an architecture that is linear scalable and is based on the same mathematics that MPP is based on. Single Link tables represent associations, concepts linking two or more KEY ideas together at a point within the model. They also represent the GRAIN of those concepts.

Because the link tables are always a Many To Many, they are extracted away from the traditional relationship (1 to many, 1 to 1, and many to 1). The Links become flexible, and in fact, dynamic. By adding strength and confidence ratings to the link tables we can begin to gauge the STRENGTH of the relationship over time.

Dynamic mutability of data models is coming. In fact, I'd say it's already here. I'm working in my labs to make it happen, and believe me it's exciting. (only a geek would understand that one...) The ability to:

* Alter the model based on incoming where clauses in queries (we can LEARN from what people are ASKING of the data sets and how they are joining items together)
* Alter the model based on incoming transactions in real-time (by examining the METADATA) and relative associativity / proximity to other data elements within the transaction.
* Alter the model based on patterns DISCOVERED within the data set itself. Patterns of data which were yet previously "un-connected" or not associated.

The dynamic adaptability of the Data Vault modeling concepts show up as a result of these discovery processes. I'm NOT saying that we can make machines "think" but I AM suggesting that we can "teach" the machines HOW the information is interconnected through auto-discovery processes over time. This mutability of the structure (without losing history) begins to create a "long term memory store" of notions and concepts that we've applied to the data over time.

Through recording a history of our ACTIONS (what data we load, and how we query) we can GUIDE the neural network into better decision making and management over the structures underneath. This includes the optimization of the model, to discovery of new relationships that we may not have considered in the past.

The mining tool is:
* Mining the data set AND
* Mining the queries AND
* Mining the incoming transactions

to make this happen. We've known for a very long time that Mining the data can reap benefits, but what we are starting to realize NOW is that mining these other components really drive home new benefits we've not considered before. In the Data Vault Book (the new business supermodel) I show a diagram of convergence (which has been bought off on by Bill Inmon). Convergence of systems is happening, Dynamic Data Warehousing is happening.

These neural networks work together to achieve a goal: creating and destroying link tables over time (dynamic mutability of the data model) while leaving the KEYS (Hubs) and the history of the keys (Satellites) in-tact. Keep in mind that the Satellites surrounding Hubs and Links provide CONTEXT for the keys.

I've already prototyped this experiment at a customer, where I personally spent time mining the data, the relationships, and the business questions they wanted to ask. I built 1 new link table as a result with a relationship they didn't have before. We used a data mining process to populate the table where strength and confidence were over 80%. The result? Their business increased their gross profit by 40%. They opened up a new market of prospects and sales that they didn't previously have visibility to.

Again, I'm building new neural nets, new algorithms using traditional off the shelf software and existing technology. It can be done, we can "teach" systems at a base level how to interact with us. They still won't think for themselves, but if they can discover relationships that might be important to us, then alert us to the interesting ones - then we've got a pretty powerful sub-system for back-offices.

More on the mathematics behind the Data Vault is on its way. I'll be publishing a white paper on the mathematics behind the Data Vault Methodology and Data Vault Modeling on B-Eye-Network.com very shortly.

Dan Linstedt

Posted August 27, 2008 5:54 AM
Permalink | 5 Comments |
PREV 1 2 3 4 5

Search this blog
Categories ›
Archives ›
Recent Entries ›