Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

September 2007 Archives

Like all markets, the EDW and BI market is moving (if it hasn't already) towards commoditization in implementation. Many would argue (even today) that this is "impossible" - as I've heard it, they still believe that every DW / BI solution must be custom driven, must be custom loaded, must be custom created. Why? Because we have "different requirements/different data/high volumes/real-time" and so on. We've heard it all before, it was the calm before the storm when ERP vendors rose up and said: You CAN have a standard, and the standard is "our solution." It happened with Source Systems, and now it will happen with EDW/BI solutions.

Wait just a minute, you mean to say that we are or will all be building EDW / BI solutions the same way every time?
Yes. It's the nature of any industry: first, we figure out how to do it, then we all develop "specialized ways" of doing it, then someone begins to use standards to build the solutions, then someone else says: with standards, we can repeat it. Then we "hopefully" optimize the standard process for building it.

Finally, someone says: here's a solution wrapped up for you that is optimized, and handles your line of business according to the standards.

All we have to do is look around at the industry.

1) We have a standard over-arching framework for I.T.: Zachman Framework.
2) We have a standard frame-work for EDW/BI: DW2.0, CIF, GIF
3) We have a standard logical data model: Universal Data Models
4) We have a standard data modeling paradigm: Data Vault Modeling Architecture
5) We have repeatable, consistent, redundant, and reliable business mechanisms to get things done: ITIL, CoBIT, SEI/CMMI principles and so on.
6) We have standard local and global financial principles to account for: G.A.A.P, and SARBOX, and other compliance initiatives

Now, do we have the tools or the pre-packaged solutions available?
Well, we have some vendors in this space that are starting to bring this to the table, we have others that have _highly_ specialized solutions to solve specific problems (not standard enough to be wide-spread).

Well, some would say vendors tried this in the past with "Analytic Applications", aren't we returning to that path?
No, analytic applications from a vendor perspective cost too much to customize, didn't serve all the data from all the source systems that existed, couldn't be repeatable, redundant, or fault-tolerant (once modified out of the box, they lost their value - the vendors couldn't upgrade them without destroying the customizations).

In order for commoditization to happen, someone needs to wrap business functions with EDW with BI and meet all the above standards, while providing a flexible, sustainable architecture under the covers. That someone also needs to provide a full-solution (very large, like SAP did when they first entered the market place), and it needs to be a HORIZONTALLY posititioned solution rather than a vertical stove-piped solution.

Will it happen?
You bet, it's happening now (still on a manual basis) but there are those of us in the industry who can / and have built standard solutions in a commoditized fashion for several different companies.

So what, what's the big benefit of commoditization?
Well, since you asked: LOWER COST. It's incredible how much lower the cost can be when the EDW/BI solution is repeatable, consistent, fault-tolerant, and reliable. An entire EDW that shows the business results from a full source system can be built with BI in 1 to two months now. Most people think of 90-day time-box deliverables for a single scoped portion of the business, but I see a trend, downward to 1 month deliverables - and with less people to build it as well.

Commoditization will happen within the next year, mark my words - no literally, book-mark my words... (just kidding). It will happen, and the first movers will gain much of the competitive advantage.

Typical signs of Commoditization:
1. fully-configured appliance, installed and running in record time.
2. 1 consultant on-site instead of a team of 5 or 10 people
3. Full source system integrated within 2 to 3 weeks, and business reports within 1 to 2 additional weeks (provided the business turns around their requirements that fast).
4. Generated load routines
5. Consistent, repeatable, measurable business process for EDW/BI
6. Optimizable business process
7. Greatly reduced risk to build out
8. Incredible cost savings (instead of $250k for a prototype, cost is $50k to $75k)
9. I.T. becomes extremely nimble and adaptable to change in the EDW / BI environment, extreme agility.

Do you have a project that was done like this? I'd love to hear from you.
Dan Linstedt


Posted September 6, 2007 5:40 AM
Permalink | No Comments |

How does your I.T. department respond when you (the business) need to make a change? Do they come back to you with a long list of impacts, a long time frame to implement, a high cost, or all of the above? Are you (the business) so frustrated that you build your own solutions in Excel or MS-Cubes, or something else? The focus of this entry examines the problem from a business point of view, with a little bit of technical speak.

In this entry we'll dive a little deeper and try to discover at least one part of the problem that is impeding flexible change and incurring huge costs. We'll also make a suggestion as to how this might be fixed, and by the way - if you are moving towards real-time, or huge volumes of data or both, then this issue begs for a solution.

I'm not suggesting that this is the only solution; I'm just suggesting this is one way to handle it. But what's the problem you ask, why do we have such long lead times, huge costs, and big impacts to our existing systems? Why does the problem repeat itself over and over again, whether we "buy" an off the shelf package, or build it?

Well, let me take you back for a minute to something called Data Modeling. Data Modeling is the art or science of placing your information in to an architecture that is supposed to show relationships, and containers for what we call "structured data." Structured data simply means we've defined "fields" (like you define column headers in Excel) in which we usually hold specific TYPES of data that is pre-defined, and grouped. When we group these fields in to "tables" we are stating that they are similar in some way, or related in some way, it's like separating "finance from sales" on two tabs in Excel, or it's like defining two Excel tabs, line-item revenue from a list of products or customers on another tab.

So fields and tables are inter-related?
Yes, on a very basic level, they are more like "associated" information that lives together; this is what forms a "row" of information in a database, or in an Excel spreadsheet.

With you so far, so where's the problem?
Ok, we're getting to that. First, we need to make a statement like this: in our business, a Customer may have many portfolios, but a portfolio may have only one customer, furthermore we say: each portfolio may have many accounts, but each account can be in only one portfolio.

What we've just defined is a hierarchy, a dependency of information, i.e. business rules for how this data is supposed to interact or behave in association with each other. We've defined "dependencies" across our information.

The Entity - Relationship Diagram (ER Diagram), or better yet, relationships across "tabs" in my Excel Spreadsheet may look as follows:
ERDiagramSimpleModel.jpg

What we've done AS A BUSINESS is stated:
"This is how we do business today." Or better yet, "This is how we think the business should be run today."

Why is this a problem? Where is the problem?
Well, if the business has a simple model (like this one), it's really no issue. The business can grow and change, and re-define the model (make changes to the relationships). The problem comes in two ways:
1) volume & latency change everything, volume = size of model, latency = how fast does the business need to change?
2) Parent and Child relationships change. Dependencies change over time.

Remind me again, why this is a problem? We’ve been doing it this way for years, and...
Well, before you finish your thought - let me say this: how many times does a "new source system, new application vendor" on the block come around and say "We've solved all your issues for CRM/ERP/SCM, etc... all you need is our system, then you can sunset your existing system(s) and lower TCO!"

They come up with new ways of representing the same old problem, parent-child relationships. They build new data models, and pile new applications on top of it with wonderful new functionality, but eventually (sigh) the business must change, and voila - I.T. and the business are back at it again, trying to figure out how to build change into these systems.

Ok, enough discussing it, what is the problem?
PARENT - CHILD RELATIONSHIPS, Dependencies across tables. In our example it would be how customers relate to portfolios, and how portfolios relate to accounts. Suppose your business moves on, and a year later decides: You know, a portfolio really can be attached to more than one customer at the same time (for instance, husband and wife are two separate customers in the list).

Now, you've done it! You've changed the business, you've also issued the change decree to our tiny little source system. Let's take a look at the new data model:

ERDiagramSimpleModelChanged.jpg

The diagram on the left is the original model, the diagram on the right is the new model.

Now you say, that change was easy! There's no reason why it should cause such heartburn in my business... Why the impact? Why the issues? Why the high cost?
Hold your horses Nellie....

I'll tell you. Here are the problems with the change to the data model:
1) Any "backups" you may have been making for the past year or two probably didn't include the original data model. If you've had a smart database administrator, they would have included a backup of the data model - so that restoration of the original data is possible. Otherwise, you'd be out of compliance in your source system.
2) The historical data doesn't meet this new rule; however it can be altered to fit the new structure going forward, why? Because this structure is more flexible and forgiving than the old one. Placing a "relationship table" or what we call many-to-many in the design increases the flexibility to represent _any_ relationship that comes forward in time. However, their have been cases where we force the data model to represent "business referential integrity" - in other words, this account cannot exist without being attached to an existing portfolio.

Even this fails over time, the relationship is sometimes shut-off / disabled, or goes bad (can never be enforced in the source system) for a variety of reasons.

3) the more parent-child relationships we have in our model (like the one still left, accounts and portfolios), the more inflexible our models are, the higher the cost of business change, and the larger the impact. Why? Why? Why? Well, suppose there are 25 child tables (25 pieces of information dependent on account), when account changes, the dependencies (all of them) are impacted, all the loading routines, validation routines, data access routines, and so on. This is why when you build a "model" of data in Excel, you typically build it very shallow: one, two or maybe three layers thick, because you have to change it quickly.

Ok, I'm kind of seeing this... I think I partially understand, so how do I fix it once and for-all?
It requires a new kind of thinking, an ability to express the data and the relationships independently of time / business changes over time, and it requires the ability to move the job of "referential integrity checks" back into the business logic, back in to the _application_ layers.

As long as parent-child referential integrity exists within _any_ data model, we (the business) will continue to suffer huge and growing change impacts, and high costs to business changes, and inflexibility. We need to shift the way we treat the referential integrity (which means, that in order for an account to exist, it must be tied to a portfolio that exists).

We need to put many to many relationships in 100% of our data model, we need to express the fact that portfolios (last year) could have one and only one customer, but this year, could be tied to two or more customers. We need to express (in the model) that accounts could be for one and only one portfolio (last year), and this year, could be tied to more than one portfolio.

The newest model should appear as follows:
ERDiagramFullModel.jpg

It isn't just for data warehousing, it's for Operational Systems too! Of course the arguments against it are as follows:
1) My data no longer represents the Ref. Integrity of the business, this is blasphemy!
2) I can't live like this... (why, I don't know)
3) This doubles the number of tables I have to manage!
4) This is ridiculous; I now have to change my application... (to do what?)

Here are the reasons why you'd want it this way:
1) isn’t Referential Integrity merely a statement of how business does business today? Isn't that subject to change when the business changes? Why then should it be locked down in a data model?
2) Can't help you if you don't believe it, I guess you'll have to continue to suffer the consequences of high cost, huge change impacts, and inflexibility.
3) Yes, and.... What impact does "doubling the tables" have if it brings you high levels of flexibility and better yet - the mathematics prove that over time, you change "less and less" of the existing model? New relationships, and new parts to the model become ADDITIVE.
4) Yes, and shouldn't your business applications be the ones with the business rules? Aren't they the ones with all the rest of the business rules? Why should a data representation change when the business changes? Why shouldn't _all_ the rules, including referential integrity be defined in the application (one step closer to the business)?

What if the application is a data warehouse application?
Great! You've just increased the flexibility to handle change in your data warehouse, absorb new systems, split off old ones, and so on.

Ok, I kind-of buy it, but what about Referential Integrity, when should that be enforced?
In two places:
1) When the data is captured within the application capturing it - it would clean up a LOT of these source systems, and put much more rigorous business logic (and cleaner data) into the source systems to begin with.
2) When the data is "released" for users, to reports, to screens, to output. This is when reusable common services / routines for getting data out are helpful. They implement the referential integrity in the application layer.

WHY? WHY? WHY? I still don't understand why you'd want Ref. Integrity in the application layer...
Well, look at it this way, if historical data "breaks" the current model, and you don't have the "old model" around to re-load it, then the historical data can no longer be accessed by auditors. (Bill Inmon talks about this in DW2.0, and handling metadata, of course Bill has been talking about this since 1990).

Another example is: if the application is "broken" and is loading data to the model that doesn't meet the business requirements of today, wouldn't you want to know where that broken piece is? You should, it's costing you (the business) money!! If you can't capture the broken data, then the broken business process may _never_ get fixed. We have to capture it to expose it, and expose it to know that it's there, and know that it's there to decide what to do about it.

What happens when I need to add a new piece of information to my business? Or merge a new system in? What does my model look like and how fast / easy is it to change?
Let's say, that in this example a new system had services, and the accounts are now to be rolled up into services, which are then owned by portfolios.

The original model (with the new system change) would look like this:
ERDiagramOriginalChange.jpg

The impact doesn't look too large, until you begin to consider how many other tables (owned by account) are impacted by the change in near the top of the model, not to mention the new & old data sets no longer jive - and to load "historical" data in to the operational system for an audit requires the old data model to be re-built as well. Integrating these sets of information is near impossible.

The new model (with using many-to-many relationships) would look something like this:
ERDiagramChangeRel.jpg

In this model, new data can be represented going forward, while old information still exists in the system of record (source system), this provides flexibility. In both cases, the impacts are highlighted with a green background.

I've developed a formalized data modeling architecture that is proving itself around the world as the mechanism by which to build these solutions, it's called: Data Vault Data Modeling (Common Foundational Integration Modeling) and it's being used by Tax Service in the Netherlands, US Government, SNS Bank, (WellPoint Health Care) Blue-Cross-Blue Shield, Diamler-Chrysler, and a number of other institutions around the world.

I hope this was helpful, as to explaining why there are large impacts, and high costs to making changes in business. I'd love to hear from you, thoughts, ideas, and comments..
Dan Linstedt
You can find out more about the Data Vault modeling at: http://www.DanLinstedt.com


Posted September 5, 2007 5:30 AM
Permalink | No Comments |

Interesting thoughts abound around the issues of determining context for a finite state model. If you've ever considered the metadata stored within a "data model" then you may be partial to the discussion here today. What I've got is an engine that starts with structural definitions, these definitions exists in a finite state, sometimes they openly declare hierarchies, other times, they hide the associations and relationships in the naming conventions. However one "true-ism" must reside: if a human cannot make "sense" of the structure, then it must be unusable. Therefore it is my belief that there usually (in most cases) is a finite state of context which can be automatically determined based on the structure through a series of mathematical, statistical, and ontological approaches.

These techniques are necessary in discovering the potential of unstructured data sets within DW2.0.

Ok, what's all this geek-speak?
What I'm saying is this: Data Models are constructed from a fundamental need to "house/organize/and store" information in mass-quantities. Data Models (much like file-folders) have levels of grain both within, and across multiple structures. The elements within a data model are typically stated according to some "finite" naming convention. Things like abbreviations, annotations, squashed / shortened definitions, relationships (referential integrity) and so on determine partial definition of the element.

In other words, they can _assist_ in placing the entire field / attribute into a hierarchical taxonomy. And further, can place the entity (table) into an over-arching ontology. So why do we spend all this time deciphering data models to "understand how our source systems store information" when we should be spending time deciding how to apply that information to a better more useful context?

Isn't there some mathematical way to represent the language of attributes and entities to which we can tie common meanings and definitions?
Turns out there is... free sources (like ontology engines such as WordNet) can help with language definition and raw metadata understanding from around the world. I believe that by tying this information together with a formulated understanding of a given data model that we can begin to understand "what we really represent" inside our capture systems.

What does this mean?
Well, what I'm referring to is the science of taking "what-is" described, making sense of the context through ontologies and taxonomies, then applying those definitions against "what the business thinks is happening" This produces a significant GAP analysis, it can also be bounced against what data is "stored" in the system, and whether or not the information matches up with the basic goals of the taxonomy.

In other words, there is inherent meaning to the design (even if the design is encrypted) that makes sense or provides context to the data _at the grain at which it stands_ within the data model. For example, in English we read left to right. Typically things on the "left side" carry more importance than things on the "right side". In the case of forumlas, things on the "left side" are where the final answers are put, while things on the "right side" define how we get there (computation).

Ok, so we have a field name:

CAP_EXP_TOTAL

What can we glean from this?
If we assume that the ontology for this definition is in the line of finance, we might end-up expanding the abbreviations to:
CAPITAL
EXPENDITURE
TOTAL

Each abbreviation has a subsequent meaning. The "total" component may be a definition of a function, and context of the data might only be derived when looking at the taxonomy (table name) that encapsulates this information. If the table is a SALES table, then that might be computed one way, if the table is a FINANCIAL table then that total may be computed differently. We can prove this fact by profiling the information housed within the fields.

I digress... What I'm discussing here, is the notion of gaining partial context from a semantic ontology layer, partially built from the model, and partially backed by "lookup" on definition, and the ontologies that each individual word might be housed in a larger ontology (from WordNet for example).

Now, if I had a field called:
TYP_CD

We might not be able to discern what "type" of "type code" this thing is without looking at the enclosing entity. On the other hand, TYP_CD might actually be abbreviated this way:

HULL_TYP_CD
TAIL_TYP_CD
CUST_TYP_CD
or even:
ACCT_TYPE_CODE

Each of these abbreviations share the general context of "type code". The fact that it is a "type code" is less important than what "type" of code the data is, and what grain the data lives in. The resulting abbreviation such as "TYP_CD" can then be generalized to a shared ontology, regardless of industry specific model, it can then also be represented by a finite set of definitions like: "TYPE_CODE, TYP_CODE, TP_CD, TYPE_CD" and so on. The shorter the abbreviations, the worse the confidence of an engine is to determine actual matches without looking at the data.

Ok, so what does all this mean?
This means we can automatically examine the data, the model, the ontologies, and glean or construct: partial meaning, grain, global or local (shared definitions/specific definitions), overlap, and finally: we can mathematically optimize the STRUCTURAL MODEL we have in order to achieve a better result, a more common result where the metadata of the business shifts to the "applied function of the data" at run-time.

We can achieve partial deterministic finite context for the base definition / storage of the data, and the actual congruence of the data set across multiple data models. It means we can provide 60% to 80% commonalities across data sets and data models around the world, it means potential standardization of grain and semantic layers.

If you have a sample set of a data model you'd like me to analyze in this part of the blog, I'd be happy to give you my thoughts on the break-down and semantic meanings, we'll see how well I do without knowledge of your business.

Cheers,
Daniel Linstedt


Posted September 2, 2007 11:34 AM
Permalink | No Comments |

Well, if you're like most of the world, you have an EDW that is built from a Federated Star Schema... This could be part of the problem, but definitely not all of it. A large part of the problem is where the business rules sit in terms of processing data going IN to the EDW. Regardless of architecture, if the Business rules sit between the source systems and the EDW (and actually don't sit in the source systems themselves), then we (I.T.) are doing a great dis-service to the business community, and ourselves. If you're like me, and have been around the block more than once with a Data Warehouse, then you know that "Changes to business usually don't come cheap" within the EDW environment - and that usually translates into high dollars, high impact, and eventually high-risk (resulting in Super-Nova of Star-Schemas as an EDW).

So what... I don't have this problem, my EDW is fast, nimble, and changes quickly...
Great, then this particular entry is not for you (it may be light entertaining reading for you), but for those of you with issues when the business changes, read-on....

So much of what we do in I.T. in building an EDW is to "help the business", but what we have failed to realize is that the industry for years has been feeding us an interesting way of getting that done. We've heard time and time again that we must implement "business rules" on the way IN to our data warehouses. That if we deliver "bad" data, that our warehouse efforts have failed, that we are to "fix data", fudge data, change data, or in other disguised terms: "quality check data, aggregate data, fill in blank values, compute data based on a lose set of business rules" or my personal favorite: "just get it to match the operational report..."

Wow, what if the Operational Report is wrong? They usually don't consider this to even be a possibility.... well who moved on, and blessed the operational report as "king?" Has the operational report been audited? Can the operational report give us the same answer that it provided for us 15 years ago? Has it changed? if it's changed, how, when, why? Usually because.... the business rules changed.

The architectural image used today looks a little like this (again, regardless of the data model)
TraditionalBusnEDWProcess.jpg

Ok - so the interpretation of the data changed, which means that the operational report (if run against historical) won't give us the same result that it gave us 15 years ago BASED ON THE ORIGINAL DATA. This is part of the point. When we (I.T.) take it upon ourselves to change data on the WAY IN to the EDW, we are subjecting the data in the EDW to the BUSINESS RULES OF TODAY... we're setting our EDW up for "failure" in no uncertain terms.

Our architectures, data models, load routines and queries all match "today's version of the truth." and the minute the business changes, or asks to see today’s data in comparison with historical - the warehouse and all it contains become invalid (to some degree), because the business now interprets data differently.

How on earth can we expect to audit such a system? It becomes increasingly difficult (even with audit trails of raw data sets) as the business continues to change, and we continue to implement those changes into our system GOING IN to our EDW.

How on earth do we mitigate this? This is a huge risk, huge concern right?
Well, there is one way, but it requires a fundamental shift in thinking as to WHAT an EDW is supposed to be, and how to load it. The fundamental shift is: the EDW should be an INTEGRATED (by business key only), raw view of historical, unchanging data at the grain in which it arrived. The data itself should not change even when the business changes.

The other thing this means is: move the BUSINESS PROCESSING of that data downstream- going from the EDW into the Data Marts. This is the right place for these business functions. This not only makes I.T. more nimble (moving the business logic closer to the business) but enables the EDW to be consistent over time, regardless of any business changes - no matter how big or how small. An architecture for this might look as follows:

NewBusnEDWProcess.jpg

I'm also suggesting that "exposing" the data flaws, and showing what the systems are producing "as-is" is just as important for fixing money loss at a business level, as showing cleansed and quality driven data. We (I.T.) need to put accountability for the data itself back in to the hands of the business, so that they can align what the source systems are doing with what they envision the business actually doing.

Exposing the flaws in the data, and the gaps between the belief of business processing versus what is actually happening can lead to real changes in the way business operates.

Again, is there value in "cleansing, cleaning, changing, standardizing data"?
Absolutely yes, but not on the way IN to the EDW, only on the way OUT to the data marts. This type of thinking allows us to become more nimble, and responsive. It lessens the impact on the EDW systems we've built when the business wants to change, and it provides the business with a manner in which they can be auditable, compliant, and accountable for finding and fixing the operational gaps in their source systems.

Love to hear your thoughts, and as always - if you disagree, I'd love to find out why, there's a good chance I've not thought of it in a different light.

Thank-you very much,
Daniel Linstedt


Posted September 1, 2007 4:10 PM
Permalink | No Comments |

Are you caught in the explosion of your life? Have you gotten to a point where federated star schemas aren't cutting it for Enterprise Data Warehousing? Is data volume pushing impacts on the star that can't be handled? What does a change to your single enterprise conformed dimension cost? How much time does it take, and how many impacts are there to integrate new systems or more data to your existing federated system?

Star schemas are GREAT ARCHITECTURES for solving subject oriented answers, they were and are not designed to be enterprise data warehouses! They get to a certain point where they get in the way of being nimble, creating workable solutions for the business, or constrain the business (because of cost or time to implement - due to impacts of changes). The business is now caught, in a super-nova exploding star. Let's take a look to see what happened, and what we can do to "fix" this situation.

Super-nova star schemas, ok: how did we get here?
We had no other way to build an architecture for an enterprise data warehouse, so we used the best we had.... True, that's how we got here. Now there's another, better, more nimble data modeling architecture on the block that solves these problems - and it's called: "COMMON FOUNDATIONAL ENTERPRISE MODELING ARCHITETURE" But for short, it's actually called: "Data Vault Modeling Architecture", and it's been available since 2000. It's been endorsed by Bill Inmon, Clive Finklestein, Claudia Imhoff, Lowell Fryman, Wayne Eckerson, and numerous others in the industry. Bill recently stated: "Data Vault is the optimal choice for DW2.0..."

Ok, but I don't have a super-nova on my hands, my federated data warehouse is working just fine... Why should I look at this?
If the architectural model is serving you today, cool - don't change. No problem. But at least bear with me to read about the problems that do occur with federated star-schemas that are currently going super-nova....

Signs of an impending super-nova:
* Business can't approve _business changes_ because the impacts cost too much or take too long for I.T. to implement into their Federated Star

* Business is constantly arguing over "what the conformed dimension means", causing either dimension-itis (too many dimensions), or duplicate copies of the data in the dimension, to be slightly tweaked to meet "individual needs."

* Business is constantly building their own spread-marts to answer their questions because I.T. is now "frozen" by an inability to change the federated mart (too many users, too many dependencies, too high a cost, too large an impact).

* Business is constantly building their own spread-marts because I.T. won't agree to provide data in a SUBJECT ORIENTED BUSINESS FOCUSED PURPOSE, because supposedly "it's already available in the enterprise federated star."

* 160 to 300 columns "appear" in a dimension, making it near impossible to manage or to provide full value to the business going forward.
* It becomes difficult if not impossible to integrate a new system into the federated system.

* "Copies" of the federated stars have propogated through-out the enterprise and then have been subsequently altered to meet specific business needs, now - the answers from those different federated stars no longer are consistent.

* The star has "JUNK" and "HELPER" tables embedded all over the place. What is a JUNK table anyway? And if it's JUNK, why is it in my model? Helper table? What's it helping? is it a band-aid for the architecture?

* Volumes have outgrown hardware capabilities, pushing the purchase of an appliance which only staves off the inevitable.

* Real-time processing conflicts with the batch processing

* Real-Time processing forces UPDATES to the data set, or causes orphaned facts, or worse yet, orphaned dimensions with keys that simply never make it to the rest of the business, and cannot be used.

* When was the last time that AUDITABLE and TRACEABLE data could be produced from your federated star-schema? Can you reproduce the source system data by reversing the algorithms (mathematically) to answer an auditors questions?

Well.... Hmmmmm... What on earth are you talking about?
Architecture, architecture, architecture.

Would you take a Porsche 911 and weld a trailer hitch to the back-end, to haul a camper up a 30 degree incline in 90 degree (farenheight) weather? COULD YOU DO THIS? YES, you can modify the ARCHITECTURE and adapt the car to actually do this, but what would happen? In this situation, you probably would blow out the motor right?

Any time you "adapt" an architecture, or apply band-aids to make it fit where it wasn't designed to fit, it inherits limitations and eventually "blows apart." (the wheels fall off, the band-aids don't stick anymore, it becomes un-managable).

Likewise would you take a Big-Rig, chop the top, leave the frame and motor in place, place a new fiberglass body on it and try to win an auto-race? No... Could you? Yes you could, but it doesn't make sense.

The Data Vault (Common Foundational Enterprise Architecture) provides this capability as a mathematically designed solution built to suit the needs of enterprise data warehousing. Star Schemas WERE and ARE designed to be a SUBJECT ORIENTED ANSWER DATABASE for use in OLAP solutions... They work wonders in pivots, and OLAP type processing, and yes - the architecture is valuable when applied appropriately, but for our sakes, don't apply the modeling paradigm to be your enterprise data warehouse.

Ok, so what is available? What can I do to stop the Super-Nova? I have one that I might end-up with, can I prototype something that will keep me from getting there?
Yes. The Data Vault modeling technique is flexible, highly scalable, and FREE - just like 3rd normal form, and star schema, the modeling techniques are freely available. It is the next step in data modeling for the enterprise data warehouse, you can build a very small data vault model as a Proof of concept, then proceed to build larger models on top of the existing foundation, and eventually "re-feed" your star-schemas as Type 1 dimensions.

With a Data Vault as the EDW, you no longer need Type 2 Dimensions to house history. Also the need for Conformed Dimensions is no longer an absolute necessity, you can do it where it makes sense. It releases the impacts of making changes because "everyone and everything" is using a particular conformed dimension.

Remember, like 3rd normal form or star-schema there is a full approach to using, building, implementing a Data Vault. More on that next time.

Cheers,
Dan Linstedt


Posted September 1, 2007 3:37 PM
Permalink | No Comments |

1 2 3 NEXT

Search this blog
Categories ›
Archives ›
Recent Entries ›