This is a long overdue entry from way back when. I've recently received some good questions on my original entry for this discussion on a data modeling technique that sits squarely between Inmon and Kimball, or should I say, squarely between 3NF and star schema. This technique is called the Data Vault Data Modeling architecture. In this entry we'll dive a little deeper and answer some of the questions posted on the last entry.
Are there healthcare and/or insurance companies that have successfully implemented an enterprise or patient care data vault? (Question from a reader).
Yes. Blue-Cross Blue-Shield (Wellpoint in Saint Louis, MO) is just such a company, and they have (last I'd heard) 18TB or more in their Data Vault on an IBM DB2 UDB platform, using Informatica as an ETL tool, and IBM Websphere for EAI purposes.
What are some of the Best Practices around a hybrid hub-and-spoke approach (i.e. data vault with supporting data marts)?
Please keep in mind that standard enterprise data integration / data warehousing best practices always apply, no matter which data modeling technique you're choosing.
Some of the best practices include the following:
* Use views for virtual data marts (to produce dimensions and facts virtually) until performance becomes an issue. When performance becomes an issue, then "instantiate" denormalized tables underneath to help with speed of queries.
* Produce an ERROR MART for poor quality data, or data that breaks business rules, move your "bad" data to the error mart, and not into the accountable marts. This helps increase business accountability, increase visibility of bad-data (and ultimately the business' desire to fix it).
* If you're going to use Data Quality engines to "clan and alter the data", then ensure you have audit trails somewhere, either from the tool itself, or better yet - kept inside the Data Vault by using Link tables with Satellites to answer these questions. This will help with the auditability and traceability (compliance) of the data sets within the Data Vault structures.
* Remember the Data Vault is a true data warehouse, that supports both active and batch feeds at the same time, and that it is very good for data mining (inside the database), and is not so good (due to high numbers of joins) at data delivery. It is not for the average user to "sit down and query." the average user should go through managed views, or a data mart should be built for those types of questions. Power Users and trained business users looking for specific anomalies can use the Data Vault in all it's glory to achieve some extremely high-end results.
* Denormalize using Materialized views, or Point-in-Time or Bridge tables which are defined on my web-site, but not in my articles (to any great degree).
* Don't break the architecture - one of the strengths of the Data Vault is it's flexibility and scalability over time. It's ability to continue to meet the business needs without huge impacts when business changes need to be made to the model. As with any modeling technique - if the formalized rules are broken, then the model deviates into "never-never land" and begins to take on characteristics of a ball of band-aids, which eventually is left out in the rain and melts away. It eventually stops meeting business needs, and increases impacts to the point where business is constrained and "can't afford" the impact that their change requires.
Is there a matrix for when do you which data warehouse modeling technique? I know there is a lot around Inmon vs Kimball but nothing which includes data vault. What business questions need to be answered to determine the technique?
Business / Technical questions that will lead you to the Data Vault:
* Is your current EDW so large that volume and performance are binding up the SQL queries?
* Is your EDW currently batch-driven, and star-schema based, such that going to a real-time enterprise is proving to be a real hassle?
* Is your EDW matured, and business is saying "no - we can't afford the time/cost to make the change to our star-schema based EDW"?
* Does the data model of your EDW look like a patch-work quilt? In other words, does it follow all the rules of good clean data modeling, or have "exceptions" to the rules been made to accommodate business changes in the past, and now - making changes results in a huge list of impacts / time / or cost?
* Is the volume of data in your EDW or the number of tables currently made the EDW model inflexible to change?
* Does your model contain Junk, and Helper tables? why? What is the helper table "helping?" Is it helping the architecture, or is it helping the data set? If it's truly Junk - why is it in the EDW model to begin with? According to good data architecture practices, nothing should be in the data models without a defined purpose.
* Is your business struggling with compliance, auditability, and accountability within the EDW?
* Does your business have a need to recovery from a disaster quickly? If so, how many times has your EDW data model changed over the past year - which leads to: can the data from a backup truly be restored to a parent or child table?
* Does your data model suffer from Dimensionitis? Too many dimensions created for too many purposes, leading back to stove-piped solutions, rather than something that resembles and EDW / ADW.
* Does your data model contain cascading change impacts when parent tables need to be inserted at the top of the change? (This is a result of modified 3NF style data warehousing modeling where the surrogate and the time-based key are part of the Primary Key)
* Does your ADW/EDW model support splits of Type of data and Rate Of Change?
* Does your Dimension have Dementia or a split personality? In other words, how many of the attributes within the dimension REALLY represent or describe the business key at the top of the table? How many attributes are tacked on to the dimension just because they were "joined" somewhere upstream, and they look like they belong there?
* Has the level of volume reached a point where nested-sub queries begin to be problematic across your Star Schema, especially when you're examining a wide slice of customers with all their history?
There are quite a few more, this is just a start. If you have some of these issues, and wish to share anonymously, that would be wonderful. I'm looking for alternative and opposing views as well.
Feel free to ask more questions, I'll try to post the answers.
Posted October 18, 2006 6:36 AM
Permalink | 1 Comment |