Blog: Dan E. Linstedt« Who's on First - Abbot and Costello (parody) | Main | Giant Jellyfish Rip Nets in Japan » Redfining the EDW and ODSThis was a hot topic for most of you, with compliance breathing down our necks and the government hot on the auditing trail we have to do something. And something we shall do! In fact, the nature and notion of EDW and ODS is changing, as I blogged in my most recent entry in this category. I made a statement: "Flip the coin, and store RAW data as-it-stood on the source system, but in an integrated fashion in your data warehouse; now what have you got? A solid architecture (if modeled properly) which allows data to be auditable from that time period before the change. The Data Warehouse has now become a system-of-record." and a comment was made, that this sounded like an oxymoron - I was asked to elaborate. In this entry I'll attempt to explain what I mean by this statement. It's very possible that I didn't state it quite "correctly".... Ok, here are the facts, just the facts... I believe our data warehouses must return to storing data "as it stood" in the source system - that is, snapshot copies of the good, the bad, and the ugly - all in the warehouse all at the same time. But this brings with it one major problem: the Data Warehouse still must provide some layer of integration horizontally across the enterprise. What I mean is: The articles I've written walk through integration of source systems surrounding raw data for compliance reasons. One of the KEY notions is that in this example, the business uses CUSTOMER KEY to access CUSTOMER records - it doesn't matter which system they are accesing, they need some form of KEY to get the data out. No key? can't find the data... it is lost forever in the source system. Let's assume the semantic definition for Customer states that All source systems capture customer at an INDIVIDUAL level, not a CORPORATION level (or if they do capture CORPORATION) they hopefully assign different keys, and place the CORPORATION in a different source table. Ok, we've established INDIVIDUAL as the semantic layer, and CUSTOMER KEY as the horizontal integration point. Based on this notion: we must design the warehouse around the BUSINESS KEY known as CUSTOMER_KEY, and thus INTEGRATE the information horizontally into a single table called HUB_CUSTOMER. In loading HUB_CUSTOMER we use the maximum space for the largest data type, and record the load date and record source (which source system the KEY came in from), but we have an integrated list at the end of the day which provides the business with a single FULL source of customer keys that exist across ALL our source systems. Let me back up a minute and define what I mean by INTEGRATION: One could argue that we are changing the byte representation for some data (changing integer representation to character or Unicode to fit in the warehouse CUSTOMER_KEY column) but for all intensive purposes the data is still traceable, and the value of the data is preserved. So to recap the quote at the top: I mean: copy the data without transforming it, into the data warehouse. Place the same data at the same grain into the same structure (regardless of source system). In other words, customer keys for individual are placed into HUB_CUSTOMER, and customer data for individuals is placed in a satellite structure, dependant on the key - SAT_CUSTOMER - and the satellite contains data over time snapshots which establish a CRC/audit trail for information change. I hope I've cleared this up. I have a book on the Data Vault in the works which will be available Q2-2006 on B-Eye Network. I welcome all thoughts, questions, and concerns. |