For too long the industry has preached: load quality data into your warehouse, cleanse the data, manipulate it, and then load it to the warehouse. The mantra has been "never release bad data to the end users." There are hundreds of articles written about this, and probably quite a few groups carrying this mantra forward. But I have to ask, where did this mantra come from? How on earth did it get "written in stone?" What really, is the true value here? When we implement this kind of paradigm do we RISK getting the warehouse "wrong?" Are we integrating away problems which are causing the business to hemorrhage money?
I would argue that this paradigm is partially wrong, that a data warehouse should be a statement of integrated fact, NOT a "single version of the truth." Why? Because there are diamonds in the rough my friend, what I mean is: there is gold in exposing to the business the GAPS between what the source systems say the data IS, and what the Business BELIEVES (in how it's operating and collecting information).
These gaps lead to money loss, sometimes significant money loss and inefficient business processes. Unless we (I.T.) can expose the gaps between "what-is" and "what-they-truly-think" we cannot begin to help them. We must learn to expose the good, bad, and ugly information in the system, bounce this against the business processes and see where the business is broken. This is one of the true jobs of a good data warehouse. Armed with this information, businesses can begin to investigate why it's broken, or if it's even that important to fix. They can begin taking ownership and accountability for their own "bad-data" and "bad systems processes" that collect that information.
In one case, upon exposing "bad-data", the business was able to find and fix a mis-billing calculation that had been hidden in a financial billing report (operational report) for over 15 years. The business always "rounded away" the error because they couldn't find it. Now, if we (as a data warehousing / BI team) had "fixed" or "cleansed" the data going in to the warehouse we 1) never would have passed the financial audit when the warehouse was blamed for being wrong, and 2) never would have been able to expose the error to the business in the first place. Which means the business would have demanded that the same "rounding error" be coded into the data warehousing report.
This is just flat wrong. Now, am I saying there isn't value in cleansing, and quality checking the data? No, quite the contrary - I'm saying that in integrating the data going in to the EDW, it should be RAW grain, no changes to the data set, so that the errors and patterns of errors can be seen. I'm suggesting that post-warehouse rather than pre-warehouse is a better place for doing quality and cleansing, in other words: put a filtered lense on the warehouse data on the _way_ to the data marts, not on the way in to the EDW. Then construct something called an ERROR MART where "bad data" (those without keys, those that break "todayâ€™s" business rules, etc...) can be funnelled.
This means a fundamental shift in the way we look at data warehousing as a practice. I've been building EDW's this way for 12+ years, with tremendous success.. many (if not all) of these EDW's are still around today, growing, and passing compliance audits. What I'm saying is the paradigm needs to change, it just so happens that DW2.0 offers a rare opportunity to execute the change going forward, without a major impact to what's already in place. I'm also suggesting that the new paradigm be built going forward whether or not you're engaged in a DW2.0 build-out.
Remove the complexities of executing business logic "up-stream" (on the way IN to the warehouse), move these business computations and complex calculations "down-stream" to on the way OUT of the data warehouse, on the way to true data marts. This is the proper way to build and scale an enterprise data warehousing vision. Now, those of you who have built a "staging area with history at a raw level grain" have the basics of this component, in fact it is an EDW whether you like it or not, even though it goes from "staging" into "federated data marts." However, if you find yourself changing, re-rolling, or re-stating the historical data every time the business has a fundamental change, then you have compromised the compliance and audit ability, and haven't executed the paradigm shift I'm discussing here.
If you are re-stating, re-rolling, or changing historical data to represent business changes, then you may be experiencing the pain of growth, high costs and huge impacts to business change, and basically a "melt-down" of the federated star schema approach (as an EDW only, not as a data mart delivery solution). If changing a conformed dimension to meet business needs has a HUGE impact list and a high cost, then you have issues around business logic "up-stream" that can be resolved by changing the paradigm.
BUT you need a good data model architecture in order to implement the paradigm I'm discussing above, you can use a star-schema data model (IF there are hundreds of small dimensions, and IF the fact tables are lowest level of grain, and IF the data is NOT altered on the way in, and IF the dimensions are not conformed) - however, there is a data modeling architecture (recently endorsed by Bill Inmon) that I've spent 10 years designing (based on mathematical principles), and the last couple years releasing. The modeling architecture is called "Data Vault Data Modeling Architecture", Bill has stated that this is the "optimal choice for modeling the DW2.0".
The architecture in and of itself helps to enforce the paradigm shift I'm discussing, but the approach surrounding the implementation of the architecture is where the magic lies. The Data Vault architecture is freely available, just like 3rd normal form, and star-schema. It is quite simplistic and returns to business modeling for it's foundational build-out, specifically business keys. But I digress.
My points are: a) the business needs to be shown "what-is" versus "what-they-think-is", a gap analysis if you will b) the data in the EDW needs to be accountable, auditable, lowest level of grain and never "re-stated" never "re-rolled" or altered, c) I.T. and Business are running a huge risk when "cleansing and quality" are executed on the way IN to the EDW, instead of on the way OUT.
To that end, data quality processing is full of interpretation. Even if a "quality engine" is used, it can change/assimilate/aggregate data in ways that are incorrect. If there's no base-data or raw data to check it against, we can lose a compliance audit (and often do). When I.T. implements a process "on-the-way-in" to the EDW that changes data, then they are subject to the business stating "what you / your algorithms produced is wrong, now change it." Then, business & I.T. go to war, because the business can't agree on how to define it, and the requirements continually change. We (I.T.) are left trying to fit a round peg in a square hole, just to make it fit - or meet the needs of the many.
When a compliant and scalable EDW architecture is built (regardless of the data model), and star schemas for delivery are NOT federated together for the enterprise, then single stars can do what they do REALLY WELL: provide subject oriented answers to specific business user groups rapidly. This means we can source the EDW many times for different stars, execute the business rules to put the data into specific context, and load individual star schemas for different uses.
One more note about data quality / information quality: If it is done, it should be done AT THE SOURCE SYSTEM LEVEL, where the system of record exists, this way, the original and "cleansed" data can be audited at the source, and fed into the raw warehouse at the source. These "cleansed" values should be governed by a governance process and a data management specialist. On the flip side, quality can be executed when loading data from the EDW to the star-schema, and what I recommend there is: I.T. build "error marts" for broken data, and move the data that doesn't meet "todays version of the truth" into the error marts (note: my definition of mart, is any data that is customer or user facing - I am not implying any particular modeling methods to the word MART).
Hope this helps,
PS: Love to hear your thoughts, comments and experiences.
Posted August 31, 2007 6:40 AM
Permalink | No Comments |