Originally published March 4, 2009
Visit ten randomly picked customers that own a data warehouse architecture, and you will see that at least eight have developed a classic data warehouse architecture (CDWA). What do I mean by a CDWA? In a CDWA, data is copied periodically from production systems to the central data warehouse. And in some environments, before data is moved the central data warehouse, it is first stored in an operational data store (ODS). If an ODS is used, it is probably because of technical reasons.
Somewhere in this copy process, data is cleaned, filtered, and aggregated. Usually, data is copied from the central warehouse to the data marts. And, finally, depending on the business intelligence (BI) tools in use, data is copied from the data marts to cube technology. In short, this means that the classic data warehouse architecture consists of approximately four data storage layers (an ODS, a data warehouse, a few data marts, and several cubes) plus a chain of copy jobs to move data from one layer to another. Note that some organizations add more layers, such as a staging area.
It is important to understand that another characteristic of the CDWA is that it is based implicitly on the assumption that all BI reports will use data stored in the data warehouse. In other words, to be able to create a report, to perform some analytics, to show a key performance indicator, the required data has to be stored in the data warehouse first. Only then does it become available for reporting and analytics. This makes the CDW the heart of the BI system. In a nutshell, this is the architecture most organizations have created, and it is what you could call the classic data warehouse architecture.
This CDWA has served us well the last twenty years. In fact, up to five years ago we had good reasons to use this architecture. The state of database, ETL, and reporting technology did not really allow us to develop something else. All the tools were aimed at supporting the CDWA. But the question right now is: twenty years later, is this still the right architecture? Is this the best possible architecture we can come up with, especially if we consider the new demands and requirements, and if we look at new technologies available in the market? My answer would be no! To me, we are slowly reaching the end of an era. An era†where the CDWA was king. It is time for change. This article is the first in a series on the flaws of the CDWA and on an alternative architecture, one that fits the needs and wishes of most organizations for (hopefully) the next twenty years. Letís start by describing some of the CDWA flaws.
The first flaw is related to the concept of operational business intelligence. More and more, organizations show interest in supporting operational business intelligence. What this means is that the reports that the decision makers use have to include more up-to-date data. Refreshing the source data once a day is not enough for those users. Decision makers who are quite close to the business processes especially need 100% up-to-date data. But how do you do this? You donít have to be a technological wizard to understand that, if data has to be copied four or five times from one data storage layer to another, to get from the production databases to the reports, doing this in just a few seconds will become close to impossible. We have to simplify the architecture to be able to support operational business intelligence. Bottom line, what it means is that we have to remove data storage layers and minimize the number of copy steps.
The second flaw is a very technical one, and deals with the enormous amounts of redundant data that we store in our CDW. Most data marts are 100% loaded with data coming from the CDW. The same applies to cubes; which means they are packed with redundant data. Even the ODS and the CDW contain masses of overlapping data. Inside each data storage layer we also store a lot of redundant data. Most of that redundant data is stored to improve the performance of queries, reports, and ETL scripts. This redundant data is hidden in indexes, materialized query tables, columns and tables with aggregated data, staging areas, and so on.
We know that our data warehouses take up huge amounts of storage, terabytes of storage. But how much original data is there really? An extensive study done by UK-based analyst Nigel Pendse shows that a BI application needs approximately five gigabytes of original data. This is the median value. This number sounds realistic, but how does it match the results of studies that tell us that the average data warehouse is ten terabytes large? If this would all be original data, according to the study of Pendse, you would need 2,000 BI applications with no overlapping data elements, to get to ten terabytes. And that is highly unlikely. As indicated, the amount of redundant data is enormous.
Obviously, there is a reason for all this. It is performance. To speed up queries, we need our indexes, materialized query tables and columns with aggregated data. You probably think who cares about storage? Storage is not that expensive anymore. But that is not the issue. The issue is flexibility. The more redundant the data, the less flexible the architecture is. Every change we make requires an extra change on the redundant data. We could simplify our warehouse architectures considerably by getting rid of most of the redundant data. Hopefully, the new database technology on the market, such as data warehouse appliances and column-based database technologies, will decrease the need to store so much redundant data.
Doing analytics and reporting on unstructured and external data brings us to the third flaw. Most warehouses are filled with structured data coming from production databases. But the need to do analytics on external data and on data stores with unstructured data is growing. Most vendors and analysts propose to handle these two forms of data as follows: if you want to analyze external or unstructured data, copy it into the CDW to make it available for analytics and reporting. In other words, the proposals are based on copying data. But why? Why not do the analytics straight on the unstructured data source itself, and straight on the external data? In a way, that would be the Internet-style solution. If I search something on the Internet, I donít first copy all the data I need into one of my own databases. No, that data stays where it is. More and more of the document management systems do allow us to analyze their databases straight on. But can our current BI tools access them? With respect to external data, doing business intelligence over the Internet on external data can be done today in a very sophisticated way with some of the so-called mashup tools.
The fourth flaw of the CDW can be described with the term non-sharable specifications. Many BI tools allow us to include specifications to make the report definitions more data source independent. For example, in the Universe concept of Business Objects, we can define specific terms and relationships between tables. This is perfect for all the reports created with the BO tools. But what if we want to create other reports in, for example, Excel or with the Cognos tools? It probably means we have to develop comparable specifications in those tools. All those specifications are non-sharable. My experience is that most environments are heterogeneous; they do use different tools, so the need for sharable specifications does exist. This example of non-sharable specifications relates to BI tools, but I could also give you examples of other non-sharable specifications stored in ETL tools and database servers. Non-sharable specifications decrease flexibility and lead to inconsistent reports.
The last flaw I want to discuss in this article deals with flexibility. The world of software engineering has taught us that we have to separate the storage structure from the application structure. Why? Because if we separate them, changes on the storage structure do not always require changes to the application structure, which is good for maintenance and flexibility. One of the first authors who came up with this idea was David L. Parnas in his groundbreaking article ďOn the Criteria to be Used in Decomposing Systems into Modules,Ē published in 1972, where he introduced the notion of information hiding. This concept became the basis for more popular concepts such as object orientation and component based development.
No software engineer has problems with this concept, but we in the business intelligence business do. Our architectures are not at all based on information hiding. Almost all the reports we create are tight to the database server technology underneath. Take a simple example. If we use a reporting tool where we can write our own SQL statements to access a specific database server, we will use all the bells and whistles of that product to get optimal performance. But what if we want to replace our database server with another, one that supports a slightly different SQL dialect? Or worse, imagine we want to switch to an MDX-based database server, or maybe we want to access an external database that does not return a table but an XML document? In probably all those situations, we have to change our report definitions dramatically. It is time that we adopt the concept of information hiding in our data warehouse architectures to improve flexibility and to make it easier to adopt new emerging technologies.
To summarize, the CDWA has supported us well, and for many organizations, it will still be the right solution for a few years to come. But it is time for a new architecture because the demands are changing, the requirements are becoming more complex, and new technology has been introduced. But where do we go from here? That will be the topic of part 2 of this article. Stay tuned.
Recent articles by Rick van der Lans