We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

This was a hot topic for most of you, with compliance breathing down our necks and the government hot on the auditing trail we have to do something. And something we shall do! In fact, the nature and notion of EDW and ODS is changing, as I blogged in my most recent entry in this category. I made a statement:

"Flip the coin, and store RAW data as-it-stood on the source system, but in an integrated fashion in your data warehouse; now what have you got? A solid architecture (if modeled properly) which allows data to be auditable from that time period before the change. The Data Warehouse has now become a system-of-record."

and a comment was made, that this sounded like an oxymoron - I was asked to elaborate. In this entry I'll attempt to explain what I mean by this statement. It's very possible that I didn't state it quite "correctly"....

Ok, here are the facts, just the facts... I believe our data warehouses must return to storing data "as it stood" in the source system - that is, snapshot copies of the good, the bad, and the ugly - all in the warehouse all at the same time. But this brings with it one major problem: the Data Warehouse still must provide some layer of integration horizontally across the enterprise.

What I mean is:
The enterprise may have more than one copy of "CUSTOMER" records, and if CUSTOMER is defined to be a single layer, and has the same semantic definition - then we must integrate the CUSTOMER in the warehouse. How does this work? Well, I have a data modeling technique called: Data Vault. The Data Vault is a freely available, public domain notion for Data Modeling Architecture for your enterprise data warehouse. I've been teaching the Data Vault architecture at TDWI over the past couple years, and I have a site dedicated to it's nature, and discussions around it's implementation. You can check it out at: http://www.DanLinstedt.com (for free).

The articles I've written walk through integration of source systems surrounding raw data for compliance reasons. One of the KEY notions is that in this example, the business uses CUSTOMER KEY to access CUSTOMER records - it doesn't matter which system they are accesing, they need some form of KEY to get the data out. No key? can't find the data... it is lost forever in the source system.

Let's assume the semantic definition for Customer states that All source systems capture customer at an INDIVIDUAL level, not a CORPORATION level (or if they do capture CORPORATION) they hopefully assign different keys, and place the CORPORATION in a different source table. Ok, we've established INDIVIDUAL as the semantic layer, and CUSTOMER KEY as the horizontal integration point.

Based on this notion: we must design the warehouse around the BUSINESS KEY known as CUSTOMER_KEY, and thus INTEGRATE the information horizontally into a single table called HUB_CUSTOMER. In loading HUB_CUSTOMER we use the maximum space for the largest data type, and record the load date and record source (which source system the KEY came in from), but we have an integrated list at the end of the day which provides the business with a single FULL source of customer keys that exist across ALL our source systems.

Let me back up a minute and define what I mean by INTEGRATION:
Integration (in the Data Vault) means to place any data set that is at the SAME semantic layers, defined by the same semantic layered key into a single structure or set of structures in the Data Warehouse, such that horizontal integration of systems has occured. My definition of integration does not necessarily include transformation or alteration of the data set under the covers, hence the allowance for compliance based data sets.

One could argue that we are changing the byte representation for some data (changing integer representation to character or Unicode to fit in the warehouse CUSTOMER_KEY column) but for all intensive purposes the data is still traceable, and the value of the data is preserved.

So to recap the quote at the top:
"Flip the coin, and store RAW data as-it-stood on the source system, but in an integrated fashion in your data warehouse; now what have you got? A solid architecture (if modeled properly) which allows data to be auditable from that time period before the change. The Data Warehouse has now become a system-of-record."

I mean: copy the data without transforming it, into the data warehouse. Place the same data at the same grain into the same structure (regardless of source system). In other words, customer keys for individual are placed into HUB_CUSTOMER, and customer data for individuals is placed in a satellite structure, dependant on the key - SAT_CUSTOMER - and the satellite contains data over time snapshots which establish a CRC/audit trail for information change.

I hope I've cleared this up. I have a book on the Data Vault in the works which will be available Q2-2006 on B-Eye Network.

I welcome all thoughts, questions, and concerns.
Dan L

Posted January 6, 2006 3:58 AM
Permalink | No Comments |

Leave a comment

Search this blog
Categories ›
Archives ›
Recent Entries ›