Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

How does your I.T. department respond when you (the business) need to make a change? Do they come back to you with a long list of impacts, a long time frame to implement, a high cost, or all of the above? Are you (the business) so frustrated that you build your own solutions in Excel or MS-Cubes, or something else? The focus of this entry examines the problem from a business point of view, with a little bit of technical speak.

In this entry we'll dive a little deeper and try to discover at least one part of the problem that is impeding flexible change and incurring huge costs. We'll also make a suggestion as to how this might be fixed, and by the way - if you are moving towards real-time, or huge volumes of data or both, then this issue begs for a solution.

I'm not suggesting that this is the only solution; I'm just suggesting this is one way to handle it. But what's the problem you ask, why do we have such long lead times, huge costs, and big impacts to our existing systems? Why does the problem repeat itself over and over again, whether we "buy" an off the shelf package, or build it?

Well, let me take you back for a minute to something called Data Modeling. Data Modeling is the art or science of placing your information in to an architecture that is supposed to show relationships, and containers for what we call "structured data." Structured data simply means we've defined "fields" (like you define column headers in Excel) in which we usually hold specific TYPES of data that is pre-defined, and grouped. When we group these fields in to "tables" we are stating that they are similar in some way, or related in some way, it's like separating "finance from sales" on two tabs in Excel, or it's like defining two Excel tabs, line-item revenue from a list of products or customers on another tab.

So fields and tables are inter-related?
Yes, on a very basic level, they are more like "associated" information that lives together; this is what forms a "row" of information in a database, or in an Excel spreadsheet.

With you so far, so where's the problem?
Ok, we're getting to that. First, we need to make a statement like this: in our business, a Customer may have many portfolios, but a portfolio may have only one customer, furthermore we say: each portfolio may have many accounts, but each account can be in only one portfolio.

What we've just defined is a hierarchy, a dependency of information, i.e. business rules for how this data is supposed to interact or behave in association with each other. We've defined "dependencies" across our information.

The Entity - Relationship Diagram (ER Diagram), or better yet, relationships across "tabs" in my Excel Spreadsheet may look as follows:
ERDiagramSimpleModel.jpg

What we've done AS A BUSINESS is stated:
"This is how we do business today." Or better yet, "This is how we think the business should be run today."

Why is this a problem? Where is the problem?
Well, if the business has a simple model (like this one), it's really no issue. The business can grow and change, and re-define the model (make changes to the relationships). The problem comes in two ways:
1) volume & latency change everything, volume = size of model, latency = how fast does the business need to change?
2) Parent and Child relationships change. Dependencies change over time.

Remind me again, why this is a problem? We’ve been doing it this way for years, and...
Well, before you finish your thought - let me say this: how many times does a "new source system, new application vendor" on the block come around and say "We've solved all your issues for CRM/ERP/SCM, etc... all you need is our system, then you can sunset your existing system(s) and lower TCO!"

They come up with new ways of representing the same old problem, parent-child relationships. They build new data models, and pile new applications on top of it with wonderful new functionality, but eventually (sigh) the business must change, and voila - I.T. and the business are back at it again, trying to figure out how to build change into these systems.

Ok, enough discussing it, what is the problem?
PARENT - CHILD RELATIONSHIPS, Dependencies across tables. In our example it would be how customers relate to portfolios, and how portfolios relate to accounts. Suppose your business moves on, and a year later decides: You know, a portfolio really can be attached to more than one customer at the same time (for instance, husband and wife are two separate customers in the list).

Now, you've done it! You've changed the business, you've also issued the change decree to our tiny little source system. Let's take a look at the new data model:

ERDiagramSimpleModelChanged.jpg

The diagram on the left is the original model, the diagram on the right is the new model.

Now you say, that change was easy! There's no reason why it should cause such heartburn in my business... Why the impact? Why the issues? Why the high cost?
Hold your horses Nellie....

I'll tell you. Here are the problems with the change to the data model:
1) Any "backups" you may have been making for the past year or two probably didn't include the original data model. If you've had a smart database administrator, they would have included a backup of the data model - so that restoration of the original data is possible. Otherwise, you'd be out of compliance in your source system.
2) The historical data doesn't meet this new rule; however it can be altered to fit the new structure going forward, why? Because this structure is more flexible and forgiving than the old one. Placing a "relationship table" or what we call many-to-many in the design increases the flexibility to represent _any_ relationship that comes forward in time. However, their have been cases where we force the data model to represent "business referential integrity" - in other words, this account cannot exist without being attached to an existing portfolio.

Even this fails over time, the relationship is sometimes shut-off / disabled, or goes bad (can never be enforced in the source system) for a variety of reasons.

3) the more parent-child relationships we have in our model (like the one still left, accounts and portfolios), the more inflexible our models are, the higher the cost of business change, and the larger the impact. Why? Why? Why? Well, suppose there are 25 child tables (25 pieces of information dependent on account), when account changes, the dependencies (all of them) are impacted, all the loading routines, validation routines, data access routines, and so on. This is why when you build a "model" of data in Excel, you typically build it very shallow: one, two or maybe three layers thick, because you have to change it quickly.

Ok, I'm kind of seeing this... I think I partially understand, so how do I fix it once and for-all?
It requires a new kind of thinking, an ability to express the data and the relationships independently of time / business changes over time, and it requires the ability to move the job of "referential integrity checks" back into the business logic, back in to the _application_ layers.

As long as parent-child referential integrity exists within _any_ data model, we (the business) will continue to suffer huge and growing change impacts, and high costs to business changes, and inflexibility. We need to shift the way we treat the referential integrity (which means, that in order for an account to exist, it must be tied to a portfolio that exists).

We need to put many to many relationships in 100% of our data model, we need to express the fact that portfolios (last year) could have one and only one customer, but this year, could be tied to two or more customers. We need to express (in the model) that accounts could be for one and only one portfolio (last year), and this year, could be tied to more than one portfolio.

The newest model should appear as follows:
ERDiagramFullModel.jpg

It isn't just for data warehousing, it's for Operational Systems too! Of course the arguments against it are as follows:
1) My data no longer represents the Ref. Integrity of the business, this is blasphemy!
2) I can't live like this... (why, I don't know)
3) This doubles the number of tables I have to manage!
4) This is ridiculous; I now have to change my application... (to do what?)

Here are the reasons why you'd want it this way:
1) isn’t Referential Integrity merely a statement of how business does business today? Isn't that subject to change when the business changes? Why then should it be locked down in a data model?
2) Can't help you if you don't believe it, I guess you'll have to continue to suffer the consequences of high cost, huge change impacts, and inflexibility.
3) Yes, and.... What impact does "doubling the tables" have if it brings you high levels of flexibility and better yet - the mathematics prove that over time, you change "less and less" of the existing model? New relationships, and new parts to the model become ADDITIVE.
4) Yes, and shouldn't your business applications be the ones with the business rules? Aren't they the ones with all the rest of the business rules? Why should a data representation change when the business changes? Why shouldn't _all_ the rules, including referential integrity be defined in the application (one step closer to the business)?

What if the application is a data warehouse application?
Great! You've just increased the flexibility to handle change in your data warehouse, absorb new systems, split off old ones, and so on.

Ok, I kind-of buy it, but what about Referential Integrity, when should that be enforced?
In two places:
1) When the data is captured within the application capturing it - it would clean up a LOT of these source systems, and put much more rigorous business logic (and cleaner data) into the source systems to begin with.
2) When the data is "released" for users, to reports, to screens, to output. This is when reusable common services / routines for getting data out are helpful. They implement the referential integrity in the application layer.

WHY? WHY? WHY? I still don't understand why you'd want Ref. Integrity in the application layer...
Well, look at it this way, if historical data "breaks" the current model, and you don't have the "old model" around to re-load it, then the historical data can no longer be accessed by auditors. (Bill Inmon talks about this in DW2.0, and handling metadata, of course Bill has been talking about this since 1990).

Another example is: if the application is "broken" and is loading data to the model that doesn't meet the business requirements of today, wouldn't you want to know where that broken piece is? You should, it's costing you (the business) money!! If you can't capture the broken data, then the broken business process may _never_ get fixed. We have to capture it to expose it, and expose it to know that it's there, and know that it's there to decide what to do about it.

What happens when I need to add a new piece of information to my business? Or merge a new system in? What does my model look like and how fast / easy is it to change?
Let's say, that in this example a new system had services, and the accounts are now to be rolled up into services, which are then owned by portfolios.

The original model (with the new system change) would look like this:
ERDiagramOriginalChange.jpg

The impact doesn't look too large, until you begin to consider how many other tables (owned by account) are impacted by the change in near the top of the model, not to mention the new & old data sets no longer jive - and to load "historical" data in to the operational system for an audit requires the old data model to be re-built as well. Integrating these sets of information is near impossible.

The new model (with using many-to-many relationships) would look something like this:
ERDiagramChangeRel.jpg

In this model, new data can be represented going forward, while old information still exists in the system of record (source system), this provides flexibility. In both cases, the impacts are highlighted with a green background.

I've developed a formalized data modeling architecture that is proving itself around the world as the mechanism by which to build these solutions, it's called: Data Vault Data Modeling (Common Foundational Integration Modeling) and it's being used by Tax Service in the Netherlands, US Government, SNS Bank, (WellPoint Health Care) Blue-Cross-Blue Shield, Diamler-Chrysler, and a number of other institutions around the world.

I hope this was helpful, as to explaining why there are large impacts, and high costs to making changes in business. I'd love to hear from you, thoughts, ideas, and comments..
Dan Linstedt
You can find out more about the Data Vault modeling at: http://www.DanLinstedt.com


Posted September 5, 2007 5:30 AM
Permalink | No Comments |

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›