We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


The Flaws of the Classic Data Warehouse Architecture, Part 1

Originally published March 4, 2009

Visit ten randomly picked customers that own a data warehouse architecture, and you will see that at least eight have developed a classic data warehouse architecture (CDWA). What do I mean by a CDWA? In a CDWA, data is copied periodically from production systems to the central data warehouse. And in some environments, before data is moved the central data warehouse, it is first stored in an operational data store (ODS). If an ODS is used, it is probably because of technical reasons.

Somewhere in this copy process, data is cleaned, filtered, and aggregated. Usually, data is copied from the central warehouse to the data marts. And, finally, depending on the business intelligence (BI) tools in use, data is copied from the data marts to cube technology. In short, this means that the classic data warehouse architecture consists of approximately four data storage layers (an ODS, a data warehouse, a few data marts, and several cubes) plus a chain of copy jobs to move data from one layer to another. Note that some organizations add more layers, such as a staging area.

It is important to understand that another characteristic of the CDWA is that it is based implicitly on the assumption that all BI reports will use data stored in the data warehouse. In other words, to be able to create a report, to perform some analytics, to show a key performance indicator, the required data has to be stored in the data warehouse first. Only then does it become available for reporting and analytics. This makes the CDW the heart of the BI system. In a nutshell, this is the architecture most organizations have created, and it is what you could call the classic data warehouse architecture.

This CDWA has served us well the last twenty years. In fact, up to five years ago we had good reasons to use this architecture. The state of database, ETL, and reporting technology did not really allow us to develop something else. All the tools were aimed at supporting the CDWA. But the question right now is: twenty years later, is this still the right architecture? Is this the best possible architecture we can come up with, especially if we consider the new demands and requirements, and if we look at new technologies available in the market? My answer would be no! To me, we are slowly reaching the end of an era. An era where the CDWA was king. It is time for change. This article is the first in a series on the flaws of the CDWA and on an alternative architecture, one that fits the needs and wishes of most organizations for (hopefully) the next twenty years. Let’s start by describing some of the CDWA flaws.

The first flaw is related to the concept of operational business intelligence. More and more, organizations show interest in supporting operational business intelligence. What this means is that the reports that the decision makers use have to include more up-to-date data. Refreshing the source data once a day is not enough for those users. Decision makers who are quite close to the business processes especially need 100% up-to-date data. But how do you do this? You don’t have to be a technological wizard to understand that, if data has to be copied four or five times from one data storage layer to another, to get from the production databases to the reports, doing this in just a few seconds will become close to impossible. We have to simplify the architecture to be able to support operational business intelligence. Bottom line, what it means is that we have to remove data storage layers and minimize the number of copy steps.

The second flaw is a very technical one, and deals with the enormous amounts of redundant data that we store in our CDW. Most data marts are 100% loaded with data coming from the CDW. The same applies to cubes; which means they are packed with redundant data. Even the ODS and the CDW contain masses of overlapping data. Inside each data storage layer we also store a lot of redundant data. Most of that redundant data is stored to improve the performance of queries, reports, and ETL scripts. This redundant data is hidden in indexes, materialized query tables, columns and tables with aggregated data, staging areas, and so on.

We know that our data warehouses take up huge amounts of storage, terabytes of storage. But how much original data is there really? An extensive study done by UK-based analyst Nigel Pendse shows that a BI application needs approximately five gigabytes of original data. This is the median value. This number sounds realistic, but how does it match the results of studies that tell us that the average data warehouse is ten terabytes large? If this would all be original data, according to the study of Pendse, you would need 2,000 BI applications with no overlapping data elements, to get to ten terabytes. And that is highly unlikely. As indicated, the amount of redundant data is enormous.

Obviously, there is a reason for all this. It is performance. To speed up queries, we need our indexes, materialized query tables and columns with aggregated data. You probably think who cares about storage? Storage is not that expensive anymore. But that is not the issue. The issue is flexibility. The more redundant the data, the less flexible the architecture is. Every change we make requires an extra change on the redundant data. We could simplify our warehouse architectures considerably by getting rid of most of the redundant data. Hopefully, the new database technology on the market, such as data warehouse appliances and column-based database technologies, will decrease the need to store so much redundant data.

Doing analytics and reporting on unstructured and external data brings us to the third flaw. Most warehouses are filled with structured data coming from production databases. But the need to do analytics on external data and on data stores with unstructured data is growing. Most vendors and analysts propose to handle these two forms of data as follows: if you want to analyze external or unstructured data, copy it into the CDW to make it available for analytics and reporting. In other words, the proposals are based on copying data. But why? Why not do the analytics straight on the unstructured data source itself, and straight on the external data? In a way, that would be the Internet-style solution. If I search something on the Internet, I don’t first copy all the data I need into one of my own databases. No, that data stays where it is. More and more of the document management systems do allow us to analyze their databases straight on. But can our current BI tools access them? With respect to external data, doing business intelligence over the Internet on external data can be done today in a very sophisticated way with some of the so-called mashup tools.

The fourth flaw of the CDW can be described with the term non-sharable specifications. Many BI tools allow us to include specifications to make the report definitions more data source independent. For example, in the Universe concept of Business Objects, we can define specific terms and relationships between tables. This is perfect for all the reports created with the BO tools. But what if we want to create other reports in, for example, Excel or with the Cognos tools? It probably means we have to develop comparable specifications in those tools. All those specifications are non-sharable. My experience is that most environments are heterogeneous; they do use different tools, so the need for sharable specifications does exist. This example of non-sharable specifications relates to BI tools, but I could also give you examples of other non-sharable specifications stored in ETL tools and database servers. Non-sharable specifications decrease flexibility and lead to inconsistent reports.

The last flaw I want to discuss in this article deals with flexibility. The world of software engineering has taught us that we have to separate the storage structure from the application structure. Why? Because if we separate them, changes on the storage structure do not always require changes to the application structure, which is good for maintenance and flexibility. One of the first authors who came up with this idea was David L. Parnas in his groundbreaking article “On the Criteria to be Used in Decomposing Systems into Modules,” published in 1972, where he introduced the notion of information hiding. This concept became the basis for more popular concepts such as object orientation and component based development.

No software engineer has problems with this concept, but we in the business intelligence business do. Our architectures are not at all based on information hiding. Almost all the reports we create are tight to the database server technology underneath. Take a simple example. If we use a reporting tool where we can write our own SQL statements to access a specific database server, we will use all the bells and whistles of that product to get optimal performance. But what if we want to replace our database server with another, one that supports a slightly different SQL dialect? Or worse, imagine we want to switch to an MDX-based database server, or maybe we want to access an external database that does not return a table but an XML document? In probably all those situations, we have to change our report definitions dramatically. It is time that we adopt the concept of information hiding in our data warehouse architectures to improve flexibility and to make it easier to adopt new emerging technologies.

To summarize, the CDWA has supported us well, and for many organizations, it will still be the right solution for a few years to come. But it is time for a new architecture because the demands are changing, the requirements are becoming more complex, and new technology has been introduced. But where do we go from here? That will be the topic of part 2 of this article. Stay tuned.

  • Rick van der LansRick van der Lans

    Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

    Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Recent articles by Rick van der Lans

 

Comments

Want to post a comment? Login or become a member today!

Posted March 27, 2009 by ANDREA VINCENZI andrea.vincenzi@tiscali.it

Rick, your article contains a lot of true observations, but it leaves me with a very strange feeling.

First, where did you get the data about "8 out of 10 projects use this architecture?" I have read several market studies on DW architectures, and they do not support your statement (hint: check on Amazon what is by far the most sold book on Data Warehouses...)

But the thing that really leaves me astonished is that you describe the classic "hub & spoke" architecture (although you don't use this word) and its flaws (which are real), but you don't mention the fact that there is already an architecture that addresses these flaws and that has been adopted by the majority of new DW projects. This is called the Bus Architecture (or Kimball Architecture), and I'm sure I'm not telling you anything new.

I'm curious to see if you will mention that in the next part of your article (and of course I'm very curious to see what is your suggested solution, although I can guess).

I'm sorry if this sounds contentious, maybe I'm a little biased because I know the prevailing view of b-eye site exsperts site on this matter.

Regards,

Andrea

Is this comment inappropriate? Click here to flag this comment.

Posted March 22, 2009 by Bruce Cassidy

Certainly some food for thought.  However, some of your points I don't agree with.

Let's take the example that springs to mind: "non-shareable specifications".  This is largely an issue with the toolsets, not with the "classic data warehouse" as such.  In fact, creating any sort of centralised data warehouse reduces that issue somewhat, as the varying tools can all point to the same databases, so at least they are working with the same objects.  I agree that most real world environments are heterogenous, but I would see this as an argument for a centralised data warehousing environment.

Secondly, there's more to redundancy than just transforming data.  It's also about removing the coupling between systems.  So if system A changes, suddenly we don't have a myriad of self-service reports that have all broken.  Also, if system A goes down, we don't also lose all of the reporting that relies on that data.

Thirdly, much of what happens within a data warehouse environment is about process management (quality management, master data management and so on).  You don't even address that.

So while you have some interesting points, they seem irrelevent without some sort of alternative.

Is this comment inappropriate? Click here to flag this comment.

Posted March 10, 2009 by Shawn Kung

Hi Rick - great points.  At Aster Data (an MPP data warehouse vendor), we share your beliefs.  A few thoughts

(1) Operational BI flaw - copying across multiple databases not only leads to the high latency issues you pointed out, but also creates data integrity risks (after so much copying and transformation across disparate databases in the CDWA).

(2) Redundancy flaw - excellent point.  A horizontally scaling MPP warehouse appliance can significantly reduce overlapping data - primarily because of the sheer computational power to parallel process granular data.  If you can do a complex ad-hoc query (eg. multi-way joins) off original tables in an MPP system way faster than in a CDWA, the need for aggregations and complex tuning falls dramatically (as well as redundant data costs).

(3) External, unstructured data - I do agree with the trend of growing analytics on both structured and unstructured data, but disagree that it must remain external.  Certainly a federation approach is useful in cases where the data has fleeting value, but an ELT approach provides the benefit of fast loading and the ability to leverage the parallelism of MPP to analyze patterns over longer histories.  Aster's In-Database MapReduce over structured/unstructured data in a distributed MPP environment has some amazing performance advantages, for example.

Thanks,

Shawn

Is this comment inappropriate? Click here to flag this comment.

Posted March 10, 2009 by Robert Eve

Rick -

The classic data warehouse architecture is a product of the business problems and technical solutions of its time. 

With the passage of time, many of those original business problems have been solved, meaning further investments in this architecture and these problems delivers lower marginal returns.

Further, time has enabled new solutions to come to the fore.  Many of these are based on new underlying technologies and techniques that are also new.    This enables new problem to be addresses as well as old problems to be addressed differently.

Enterprises and agencies have now have many options, including blending traditional approaches with new, for example federating sourcies before feeding the warehouse, federating additional sources beyond the warehouse, and/or service-enabling data universally.

I look forward to part 2.

- Bob Eve

Is this comment inappropriate? Click here to flag this comment.

Posted March 4, 2009 by Phil Bailey

Hi Rick,

A nice build-up and I'll look forward to the 'answer', but I think I can guess where youre going with it...

I'd like to add to your point about using one of the new appliances as you suggest in Flaw 3 to 'remove' the need to create mutiple layers withiin our warehouse design. I think that if you keep your load routines as raw as possible, and then use eLT where possible, then you can drastically reduce the amount of time it takes for the data to transfer between those layers. The added benefit being that you can keep those layers, but speed them up so you can still provide traceability of data and lineage through all the varous calculations that happen with the data. For the users to be able to see what happens to the data at every transform step is vital if you want to gain trust in your BI systems. If you are suggesting that these layers are removed then I don't see 'yet' how you would address this 'explanation' or 'auditability' of your data across the enterprise.

Is this comment inappropriate? Click here to flag this comment.