Blog: Dan E. Linstedt« Automated Derterministic Contextual Coefficients | Main | Commoditization of the EDW/BI Market. » How Data Models can Impact BusinessHow does your I.T. department respond when you (the business) need to make a change? Do they come back to you with a long list of impacts, a long time frame to implement, a high cost, or all of the above? Are you (the business) so frustrated that you build your own solutions in Excel or MS-Cubes, or something else? The focus of this entry examines the problem from a business point of view, with a little bit of technical speak. In this entry we'll dive a little deeper and try to discover at least one part of the problem that is impeding flexible change and incurring huge costs. We'll also make a suggestion as to how this might be fixed, and by the way - if you are moving towards real-time, or huge volumes of data or both, then this issue begs for a solution. I'm not suggesting that this is the only solution; I'm just suggesting this is one way to handle it. But what's the problem you ask, why do we have such long lead times, huge costs, and big impacts to our existing systems? Why does the problem repeat itself over and over again, whether we "buy" an off the shelf package, or build it? Well, let me take you back for a minute to something called Data Modeling. Data Modeling is the art or science of placing your information in to an architecture that is supposed to show relationships, and containers for what we call "structured data." Structured data simply means we've defined "fields" (like you define column headers in Excel) in which we usually hold specific TYPES of data that is pre-defined, and grouped. When we group these fields in to "tables" we are stating that they are similar in some way, or related in some way, it's like separating "finance from sales" on two tabs in Excel, or it's like defining two Excel tabs, line-item revenue from a list of products or customers on another tab. So fields and tables are inter-related? With you so far, so where's the problem? What we've just defined is a hierarchy, a dependency of information, i.e. business rules for how this data is supposed to interact or behave in association with each other. We've defined "dependencies" across our information. The Entity - Relationship Diagram (ER Diagram), or better yet, relationships across "tabs" in my Excel Spreadsheet may look as follows: What we've done AS A BUSINESS is stated: Why is this a problem? Where is the problem? Remind me again, why this is a problem? We’ve been doing it this way for years, and... They come up with new ways of representing the same old problem, parent-child relationships. They build new data models, and pile new applications on top of it with wonderful new functionality, but eventually (sigh) the business must change, and voila - I.T. and the business are back at it again, trying to figure out how to build change into these systems. Ok, enough discussing it, what is the problem? Now, you've done it! You've changed the business, you've also issued the change decree to our tiny little source system. Let's take a look at the new data model:
The diagram on the left is the original model, the diagram on the right is the new model. Now you say, that change was easy! There's no reason why it should cause such heartburn in my business... Why the impact? Why the issues? Why the high cost? I'll tell you. Here are the problems with the change to the data model: Even this fails over time, the relationship is sometimes shut-off / disabled, or goes bad (can never be enforced in the source system) for a variety of reasons. 3) the more parent-child relationships we have in our model (like the one still left, accounts and portfolios), the more inflexible our models are, the higher the cost of business change, and the larger the impact. Why? Why? Why? Well, suppose there are 25 child tables (25 pieces of information dependent on account), when account changes, the dependencies (all of them) are impacted, all the loading routines, validation routines, data access routines, and so on. This is why when you build a "model" of data in Excel, you typically build it very shallow: one, two or maybe three layers thick, because you have to change it quickly. Ok, I'm kind of seeing this... I think I partially understand, so how do I fix it once and for-all? As long as parent-child referential integrity exists within _any_ data model, we (the business) will continue to suffer huge and growing change impacts, and high costs to business changes, and inflexibility. We need to shift the way we treat the referential integrity (which means, that in order for an account to exist, it must be tied to a portfolio that exists). We need to put many to many relationships in 100% of our data model, we need to express the fact that portfolios (last year) could have one and only one customer, but this year, could be tied to two or more customers. We need to express (in the model) that accounts could be for one and only one portfolio (last year), and this year, could be tied to more than one portfolio. The newest model should appear as follows: It isn't just for data warehousing, it's for Operational Systems too! Of course the arguments against it are as follows: Here are the reasons why you'd want it this way: What if the application is a data warehouse application? Ok, I kind-of buy it, but what about Referential Integrity, when should that be enforced? WHY? WHY? WHY? I still don't understand why you'd want Ref. Integrity in the application layer... Another example is: if the application is "broken" and is loading data to the model that doesn't meet the business requirements of today, wouldn't you want to know where that broken piece is? You should, it's costing you (the business) money!! If you can't capture the broken data, then the broken business process may _never_ get fixed. We have to capture it to expose it, and expose it to know that it's there, and know that it's there to decide what to do about it. What happens when I need to add a new piece of information to my business? Or merge a new system in? What does my model look like and how fast / easy is it to change? The original model (with the new system change) would look like this: The impact doesn't look too large, until you begin to consider how many other tables (owned by account) are impacted by the change in near the top of the model, not to mention the new & old data sets no longer jive - and to load "historical" data in to the operational system for an audit requires the old data model to be re-built as well. Integrating these sets of information is near impossible. The new model (with using many-to-many relationships) would look something like this: In this model, new data can be represented going forward, while old information still exists in the system of record (source system), this provides flexibility. In both cases, the impacts are highlighted with a green background. I've developed a formalized data modeling architecture that is proving itself around the world as the mechanism by which to build these solutions, it's called: Data Vault Data Modeling (Common Foundational Integration Modeling) and it's being used by Secure Symbology Incorporated, Tax Service in the Netherlands, US Government, SNS Bank, (WellPoint Health Care) Blue-Cross-Blue Shield, Fresh-Direct, and a number of other institutions around the world. I hope this was helpful, as to explaining why there are large impacts, and high costs to making changes in business. I'd love to hear from you, thoughts, ideas, and comments.. |