We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


The Flaws of the Classic Data Warehouse Architecture, Part 2 The Introduction of the Data Delivery Platform

Originally published April 1, 2009

In Part 1 of this series, we described the classic data warehouse architecture (CDWA). This architecture is based on a set of data stores linked by a chain of copy scripts; see Figure 1. Examples of data stores are the central data warehouse, the operational data store, the data marts, and the multidimensional cubes. This architecture has served us well for the last twenty years. Based on the available technology plus the demands and requirements of the users, it was the right architecture. But on one hand, technology has evolved. For example, new database technology is available – in-memory analytics have been introduced, and from the world of the Internet, we received the mashup. And on the other hand, the demands and requirements of the users are changing. 

alt

Figure 1: The Classic Data Warehouse Architecture 

In that same article, we summarized the flaws of the CDWA with respect to those new demands and requirements. In this article, we introduce a new architecture, one that we call the data delivery platform (DDP). 

Figure 2 shows a high-level overview of the DDP. The main difference between the DDP and the CDWA is that the central data warehouse forms the heart of the CDWA, whereas in the DDP, it is the software layer residing between the data stores and the reports that can be seen as the heart of the system. 

alt

Figure 2: The High-Level Architecture of the Data Delivery Platform 

In the rest of this article, we will give a more detailed description of the DDP, and we will compare the CDWA with the DDP based on the flaws mentioned in Part 1. Note that in this article, we will refer to data stores as data providers and to reports, KPIs, scorecards and so on as data consumers

The essence of the DDP is to decouple the data consumers from the data providers. The data consumers will request the information they need, and the DDP will supply the information by retrieving it from the data providers. The data consumers have no idea whether the data they request comes from the central data warehouse, a data mart, a cube, a production database, an external source, or maybe a combination of all those. In fact, this is not important to them – the data providers are hidden for the consumers. They see one large database. For a data consumer, it is more important to know that the data supplied has the right quality level, is exactly what they requested, is sufficiently up to date, and is returned with the right performance than it is to know that the data is coming from a specific data store. 

Important to understand is that we do not propose to phase out the central data warehouse itself. There are and will always be good reasons for copying data entered in production systems to a central data warehouse. These are the classic reasons for introducing a data warehouse in the first place. For example, most production databases do not contain historical data that we need for trend analysis. Another reason is that if we run complex queries on the production databases, the data entry process is slowed severely. A third reason is that we (quite often) have to clean and filter the production data before it becomes usable for consumption. And there are more reasons why we would still want to have a data warehouse. So again, the DDP most likely needs a data warehouse. 

If we look at the flaws described in Part 1, how does the data delivery platform prevent them? 

The first flaw had to do with the number of data layers in the CDWA. One disadvantage of too many layers is that it makes development of operational BI applications extremely complex. All the copying between data stores slows the process. If the DDP is in place, if all data consumers extract data through that DDP, we can start to simplify the underlying storage structure. For example, we can drop a data mart and redirect all the data consumers accessing that data mart to the central data warehouse. Or, we could remove a cube, and redirect the queries to a data mart or data warehouse. In both cases, we are removing data layers. In other words, we are simplifying the architecture. Less data layers means less copying, and that means that we will be able to get the data more quickly from the source of entry to the data consumers. In fact, we could even consider (if the systems are powerful enough) letting some data consumers access the production databases through the DDP to get access to 100% up-to-date data. 

The second flaw relates to the enormous amount of duplicate data that we store. The DDP can minimize duplicate data storage in two ways. First of all, by decoupling consumers from producers, it becomes easier to replace one database server product with another. If the two are connected, the queries probably contain product specific features that make it hard to port those reports to another database server. If the DDP does what it should do, it should be able to handle different dialects. This should allow us to replace a more classic database server product, one that requires a lot of duplicate data storage to perform adequately, with, for example, a data warehouse appliance that needs a minimal amount of duplicate data storage. Secondly, because each data consumer accesses the DDP and not one specific data store, less need exists to create duplicate data, although performance issues could still demand duplication of data. But, hopefully, the DDP minimizes the need for storing duplicate data in whatever form. 

The third flaw relates to analytics and reporting on external and unstructured data sources. The DDP should be intelligent enough to access external sources, through, for example, SOAP-based services and mashup technology. The DDP should also be able to access document management systems, email systems, and other systems that contain unstructured data. The fact that those systems will not be accessible through SQL, MDX, or other common database languages should not be an issue to the data consumer. The DDP should be able to convert the language the data consumer uses into the language the unstructured data source supports. If this is possible, there exists no need to copy the huge amounts of data from unstructured sources to the data warehouse, or to extract data from those sources. 

Non-sharable specifications form the fourth flaw of the CDWA. Currently, each data consumer tool has its own set of data-related specifications and there is no way to share those specifications. The DDP should be "smart" enough to hold specifications related to data structures. And all the data consumers should be able to “use” those specifications. Whether a report is developed in Excel, Business Objects, Cognos, or Spotfire, they should all see and use the same specifications. For example, whether or not the Northern European Region includes the United Kingdom should be known to the DDP, and all the data consumers should be able to exploit that specification. Or, if an optional one-to-many relationship exists between two tables (even if those two tables are stored in separate data stores), that fact should be known to the DDP. It is also the DDP that should be aware that different users might have different definitions for the same concept. Specifications dealing with security, such as who is allowed to see what, should also be maintained by the DDP. Of course, in each data store, we also need to register security specifications, but they only store security specifications related to their own data elements. Maybe we want to indicate that some users are not allowed to integrate data elements coming from two different data stores. 

Again, the fact that the DDP holds all those specifications does not mean that the data consumer tools are not allowed to store any specifications. It might well be that they need their own specifications just to be able to function properly. What it means is that the DDP holds the source of all those specifications. We could state that the DDP gives access to and is the guardian of all the data and all the specifications related to data access. 

The last flaw discussed in Part 1 dealt with the concept of information hiding. This whole architecture is based on information hiding. The data consumers have been decoupled from the data providers by the DDP. In fact, the DDP is the information hider. Many changes to the data stores have no impact on the consumers. Of course, they influence specifications stored within the DDP, but those changes will not be reflected in the reports. The ideal situation is that if a specific change to the data providers is not relevant to a data consumer, that data consumer should not be affected at all. 

Aside from the fact that the data delivery platform does not have the flaws the CDWA has, or not with the same intensity, the DDP has some extra benefits. For example, decoupling of data consumers and data producers makes it easier to outsource the data stores and its technical management. Those advantages will be discussed later in this series. Also, in the coming parts of the series, we will discuss topics such as the importance of contracts; the differences between the DDP and enterprise integration, information, the virtual data warehouse, and ETL; the technologies we can use to develop a DDP; and how to introduce a DDP gradually.

  • Rick van der LansRick van der Lans

    Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

    Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Recent articles by Rick van der Lans

 

Comments

Want to post a comment? Login or become a member today!

Posted June 11, 2009 by Bas Stiekema bas.stiekema@xbas.nl

Nice concept, actually I already evolved the CDWA into a new DWA with almost the same outcome as your DDL Architecture.

We took the CDWA as basis and added even a few layers to the CDWA but with the difference that almost all the layers are optional. In every project or every requirement we are asking ourselves if we need that particular layer or not, then we look (at the purpose) of the next layer and so on. Till the last layer and that is the Business DataMart (BDM) layer. The BDM layer is actually the same as you describe the DD Layer.

We didn't do all the things yet but we have the concept and are building the first projects in this new architecture.

I will follow, and where I can, contribute to the discussions on this topic. We also have some challenges to solve before we can make this architecture work.

Regards, Bas Stiekema

Is this comment inappropriate? Click here to flag this comment.

Posted May 13, 2009 by George Allen

DITTO!!!  In my enterprise, the amount of information sources is staggering, all over the map as far as data governance and owned by as many units as there are sources.  More time is spent by a user gathering data from these silos than in analyzing the results. 

I have been designing a prototype knowledge management platform that standardizes, stabilizes and simplifies all of this architecture to a single portal.  And what I have been designing looks remarkably similar to you DDP.  By abstracting the delivery platform from the sources of information allows me to implement the platform in stages, as the sources evolve into new formats and platforms.  I only need to reprogram the abstract layer and leave my delivery platform alone.

Thanks for the article.  I don't feel so "out in left field" as my colleagues think I am.

 

George

Is this comment inappropriate? Click here to flag this comment.

Posted May 4, 2009 by ANDREA VINCENZI

The so called "data consumers" usually get the data they need by issuing SQL or MDX queries. In your opinion, is the DDP going to be able to process SQL and/or MDX?

Most reporting tools have a layer whose job is to deal with different sources of data and present them in a transparent way to the user (the most famous are probably BO Universes, but Oracle BI, Cognos and Microsoft have similar objects). However, these layers are integrated with each reporting tool, while the thing that you are proposing is much more complex and undefined, and frankly I think it has limited value, unless you go into much more detail about how to implement it.

By the way, it would be nice to receive a reply from you about my comments to the 1st part of your article (and those of other readers too), since the purpose of comments should be to exchange ideas and opinions.

Regards,

Andrea

Is this comment inappropriate? Click here to flag this comment.

Posted April 11, 2009 by Rob Konnor

Great Article and It is the way of the future in DW/BI as soon as companies can let go of their Classic Designs and Legacy infrastucture investments. All to often I hear we have spent so much on X and X and we dod not dare change our ways.

Is this comment inappropriate? Click here to flag this comment.

Posted April 9, 2009 by Ronald Damhof ronald.damhof@prudenza.nl

Rick, I have read these article and it's nice to see someone write something challenging again. You rely heavily on some kind of information hiding concept which is quite nice and you apply it to data consumers. What makes me feel a bit uneasy is the DDP thingie. What kind of technology are you proposing that is capable of executing the functionality of the DDP? EII kind of stuff? Or are you urging vendors to develop this?

In working with huge amounts of data, huge data quality problems and history requirements I have trouble envisioning the ability to execute if I was to implement such an architecture.

To put it more bluntly; is the architecture you are proposing feaseable today or 'tomorrow'? And if today; how?

Part 3?

Keep up the good work!

Is this comment inappropriate? Click here to flag this comment.

Posted April 1, 2009 by Mario Rubbo mariorubbo@yahoo.com

I look forward to the next article.

Is this comment inappropriate? Click here to flag this comment.