In my recent article (Different Flavors of Design for Business Intelligence Databases), I discussed normalized data designs. In
this article, I discuss a technique known as the data vault. The data vault (see www.datavaultinstitute.com) was invented by Dan
Linstedt, and he offers certification courses for the data vault via Genesee Academy.
Data Vault
The data vault designs on the premise that there is only one version of fact (as opposed to truth). The fact is what is brought into the data vault from a source system – and should be
traceable and auditable. I should be able to trace every piece of data in the data vault back to the source that it came from and (assuming the data is still there and unchanged!) find it.
This, of course, brings to mind an important question: What business rules are being applied as the data enters the data vault? The answer: None. Business rules are applied as the data moves from
the data vault to the data mart. Hence, the data vault is the physical structure of the data warehouse. Personally, and as you will see shortly, I would not let my business community
“query” the data vault. But then again, in only very rare circumstances, do I let my business community “query” the data warehouse. I generally am along the lines of letting
my business community “query” only the data marts.
The data vault technique is built around the following constructs:
Hubs
Hubs are lists of unique business keys that define a business element (for example, Customer, Order, Product, etc.). Hubs contain:
- A surrogate key, as the primary key
- One or more columns that make up the business key
- Load date time stamp (When did it get here?)
- Record source (Where did it come from?)
That’s it. You might include some other information that is useful to the running or maintenance of the data vault, but no more actual data. Each business key must be unique and should
represent that which the business believes is what makes the business element unique. Source systems may have a Products table that has an ID/SEQUENCE/KEY that is the primary key, but is not what the
business thinks makes the product unique. For example, you may have a Product_ID and a Product_Code.
The Product_ID is defined as a sequence number and Product_Code may be an actual business key. In this case, our hub would look like:
In all cases, there should be a unique index defined on the business key (Product_Code). The Load_Date is the date time stamp that this unique Product_Code was loaded into the data vault.
Record_Source would be the name of the source system that the data came from.
Satellites
Satellites are the descriptive items of a hub or a link. We see from our Products example that there is some descriptive information that is also stored in the source table. These columns, most
likely, go into satellites. Therefore, the satellite may look like:
An interesting point about satellites is that you also might want them to be separated based on the amount of changes to specific columns. For example, let us say that Category and SubCategory
changes frequently, compared to the rest of the columns. Therefore, you may wish to break the satellite into two satellites like:
Satellites are where your changing-over-time data is located – the historic perspective. Therefore, we need to add some other columns to the tables. Specifically:
- Load date time stamp
- Load end date time stamp
- Record source
You may wish to add other columns for the running and maintenance of your data vault, but that is up to you. Now our satellites might look like:
From this, we can build any of our Type 1, 2, or 3 slowly changing dimensions that we need in the data mart. Something that needs to be kept in mind is that you always want to have a Sat_Product and
Sat_Product_Categorization row for each ProductKey in the hub. The reason is because it would be better to do inner joins than remember to do outer joins. Without the row in the satellites, you would
always have to do outer joins.
Links
Links tie together two or more hubs. Links represent the relationship between the two or more business elements. These are usually business events or transactions. Each link has:
- Its own surrogate key
- The keys from the hubs
- The load date time stamp
- Record source
Let us say that our Product (from above) is sold to a Customer via a cash register receipt. The link would look something like:
So, what does this tell us? It tells us that a Customer bought a Product and the data vault received it on a specific load date – in essence, not too much. How many did the customer buy? What
is the value? Those items would be stored in a link satellite, especially since these things can change over time. So now, we might have:
You probably will have more data (link an Invoice that would have the date of the invoice, etc.), but that just helps you build out your data vault. In that case, you might define a three-way link to
deal with that. For example:
The key points to remember are that we should be able to reconstruct the data to the source and we should not have any data that is not found in the source system.
There are some interesting tidbits that I would like to point out. The first one is “Yes,” there are a lot of tables. Personally, that does not bother me, but it could frighten a novice
DBA or businessperson. The good thing is that the flexibility of the technique and model allows us to adapt quickly to changes in the business while giving us traceability and auditability.
Second, the
ETL code is quite simplified and will run extremely well. For example, you would write code that loads hubs, links and then satellites (both hub and link satellites). Each of the hubs can
run in parallel. Each of the links can run in parallel. Each of the satellites can run in parallel. Also, depending on your hardware, you might also be able to run hub satellites in parallel with
links and then run link satellites. Therefore, the amount of parallelization is quite good.
This article provides a
very brief introduction to the data vault. There are more things that are defined in the data vault that can
help you out. If you would like to get more information, please check into the data vault website (listed previously). In order to make all things known, I was recently certified in the data
vault.
-
Chuck Kelley
Chuck is an internationally known expert in database technology. He has more than 30 years of experience in the design and implementation of operational/production systems, operational data stores
and data warehouses (data marts). Chuck teaches seminars on a variety of database and data warehousing topics. He has co-authored or contributed to four books on data warehousing and has been
published in numerous trade magazines and written columns on database technology, data warehousing, metadata, master data management, data governance and enterprise data strategies. He may be
contacted at chuckkelley@usa.net.
Editor's Note: More articles, news and resources are available in Chuck Kelley's BeyeNETWORK
Expert Channel on Database Design Techniques. Be sure to visit today!
Recent articles by Chuck Kelley
Comments
Want to post a comment? Login or become a member today!
Posted January 25, 2010 by Chuck Kelley
Leon,
Thanks for the comment. I don't consider this anything like Kimball's staging area, but I suppose a case could be made. I really see it as an Enterprise Data Warehouse.
As for the cubes or the presentation layer, you are absolutely correct. I belive that you need to create data marts and/or cubes for the business community to use. If you choose to create views to allow the creation, that would be one solution. I would write ELT/ETL to do produce them. You are correct ... I would NEVER let my business community query the data vault -- Too complex for them to comprehend and too easy to get interesting results with a query tool (at least in my opinion).
Is this comment inappropriate? Click here to flag this comment.
Posted January 24, 2010 by León Carpay
Hi Chuck, enlightning article. Never knew exactly what Data Vault was. What wonders me though, the Data Vault that you describe looks like what I (Kimball-school) call the stagingarea. I think, using the Data Vault technique, we need also something to build the cubes on. Is there a concept of a presentationarea or do the Data Vaulties just make a view on the hubs-links-satelites to built the cubes on???
Is this comment inappropriate? Click here to flag this comment.