Business Intelligence Network Business Intelligence Resource

Blog: Dan E. Linstedt

« Hidden in the un-structured information... | Main | DNA Computing - Control over DNA Molecules »

Is it time to re-define your Data Warehouse?

I've commented in the past on my definition of the data warehouse, and recently, based on that definition I've been commenting on Master Data Management. In this blog I take a step back, and post the pro's and con's of constructing a compliant (active) data warehouse. I would love to have everyone weigh in, and tell us what kind of a data warehouse your organization is implementing and why. I'd like to clear the air and see if compliance within a data warehouse is really an issue for the enterprise.

What exactly does a COMPLIANT data warehouse mean to you? Please tell us, we'd love to hear about it.

I've grown up in the industry believing that constructing auditable historical data stores is the proper way to build "data warehouses." I've had huge successes in passing audits, proving the warehouse contains correct data according to the source, and producing data marts of all shapes and sizes. In the environment we were in, with this approach, we've shown time and time again: the flaws in the operational systems (including operational reports) which were costing the company millions of dollars a year. Without auditable historical data stores (what I call a compliant data warehouse), the nay-sayers would've been right when the blamed the warehouse for being "wrong" and our team would have been put "out of business."

This approach to defining data warehouses and the process of data warehousing has lead me to new architectures (like the Data Vault data model), new methods of loading data and validating utilizing ETL/ELT routines, and writing articles on compliance and the nature of the data loads. However, I understand from a number of sources that Not all data warehouses need to be compliant - but is this really true? I'd like to hear from those who don't need the warehouse to be compliant nor auditable within their organization. I'd like to know exactly what the enterprise is using the warehouse for, and how they justify the data within.

With that, let's take a look at the pros' and cons' (from my opinionated stance) of compliant versus non-compliant warehouses:

Compliant:
Pros:
* Provides accurate data, data that matches (good / bad or indifferent) the source systems
* Provides integrated (at the semantic business key level) data sets, with non-altered details, and the same grain as the source system. Again, the data is hooked together through defined relationships across business keys (see the Data Vault modeling concepts))
* By bringing both the good and the bad data into the warehouse, can show where business processes are truly broken, can often show when they broke (as long as history is available to demonstrate this)
* Allows extremely rapid build-out of any data mart desired for the organization. Once a standard data model for compliant / historical data store has been established, data mart build out can be done quickly. We had a process that allowed architecture and design (loaded with a percentage sample data set) within an hour of the request.
* Begins to shift the vision from "warehouse" to SoR (system of record)
* Places accountability into the hands of the business users with the use of "error marts"
* Increases visibility into broken source systems, broken business processes.
* Increases security of the data set in the warehouse
* Increases metadata and provides additional data lineage discovery points.
* Business value in the grain of the data supports data mining activities, along with value to produce "what's broken" and "what's working" across business functional units.
* Master Lists are clearly produced as a "mart" or a delivery mechanism directly from the compliant warehouse.
* Compliant warehouses offer an easier path to "near-real-time" and/or "active" data warehousing, because the complex business rules are applied downstream, from the warehouse TO the mart.

Cons:
* Introduces dirty data into the warehouse.
* Begins to shift the vision from "warehouse" to SoR (system of record)
* Moves the "logic" of cleansing, grain shift, and master data production down-stream.
* Requires data marts for delivery of "cleansed/merged/mixed" data according to the business.
* Raises questions about the warehouse being / acting similar to an operational system.
* Requires (at a minimum) an added layer of data storage before end-user utilization of the data set.
* Requires additional funding for development effort (nothing above what you wouldn't do for a normalized warehouse).

Now, let's take a look at a traditionally defined data warehouse, or a non-compliant data warehouse.
Pros:
* Single storage area for all "data warehousing activity"
* Data in the warehouse is cleansed, altered, and heavily integrated (loss of grain in many cases) - producing what we like to call Master Data sets.
* Inherant business value built in to the transformation and integration layer on the way in to the warehouse.
* Doesn't necessarily require separate marts to deliver the data to the business user.
* No question that the System of Record is one or more source (operational) system.
* Master "lists" of cleansed and "fixed" or altered data become a source of business revenue, and often drive operational systems.

Cons:
* When the business changes their definition of "master data", all the transformation layers must change on the way in to the warehouse, usually resulting in data model changes, and huge impacts to accommodate the change.
* Changes often cause heartburn in IT and business - because the grain of the previously rolled up data may shift, therefore interpretation of old historical data may change on the way out. This can lead to confusing financial figures, and numerous questions about the "warehouse being right".
* It's easy for nay-sayers to prove the warehouse "is wrong", because it doesn't follow their interpretation of the business rules.
* Audits are difficult (if not impossible) to pass, WITHOUT extra production of Audit Trails along the way of the ETL / ELT routines, in other words, the data before, after, and when - of transformation - must be recorded in order to pass an audit.
* Master "lists" of cleansed and "fixed" or altered data become a source of business revenue, and often drive operational systems. (This is both a pro and a con - depending on your viewpoint).
* "real-time" or "Active" warehouses are more difficult and more costly to produce, all the data arriving must go through complex data integration rules and cleansing before landing in the warehouse. Often times the processing of these rules rely on alternate data sets which may not be available within a 1 minute or less refresh cycle.

These are just my thoughts, I'd love to hear what you would add to the pros and cons of each of these lists - I want to know what you are experiencing in the market place. Many of the warehouses built with compliance in mind (as I've described it above), have had 10+ years of success and are in fact growing today, with buy-in from finance, HR, sales, and even the corporate board of directors.

Please let me know what you think, I'm also curious to know how many of you are seeing a request for a compliant data warehouse - and just what does that mean?

Hope to hear from you soon,
Dan Linstedt

  Posted by Dan Linstedt on March 7, 2006 6:25 AM |

Comments

As you know, we have been following your approach of building our ODS and data vault to be the "source of facts". We recently had the chance to roll some of that data into an advanced analytical application that did some calculations and various rollups. When the users saw they data they were shocked to see some of the values as they appeared to be wrong. We were easily able to trace it back to the source system to validate that it was indeed what had been stored there. This in turn identified a lack of consistent business rules and processes regarding the entry of the data. Now there is a meeting with the "powers that be" to discuss how to fix this data and the processes surrounding it so that the analytic application (which happens to involve budget forecasting) can get the right data consistently.

Post a comment