Channel: Data Delivery Platform - Rick van der Lans RSS Feed for Data Delivery Platform - Rick van der Lans

 

Podcasts

The Flaws of the Classic Data Warehouse Architecture, Part 1


 

Originally published March 4, 2009


Overview

Despite the fact that the classic data warehouse architecture has served us well the last twenty years, Rick van der Lans explains why it may be time for a new architecture.

Rick van der Lans
Rick van der Lans

SOURCE: The Flaws of the Classic Data Warehouse Architecture, Part 1

 
 

Comments

Want to post a comment? Login or become a member today!

Posted March 27, 2009 by ANDREA VINCENZI andrea.vincenzi@tiscali.it

Rick, your article contains a lot of true observations, but it leaves me with a very strange feeling.

First, where did you get the data about "8 out of 10 projects use this architecture?" I have read several market studies on DW architectures, and they do not support your statement (hint: check on Amazon what is by far the most sold book on Data Warehouses...)

But the thing that really leaves me astonished is that you describe the classic "hub & spoke" architecture (although you don't use this word) and its flaws (which are real), but you don't mention the fact that there is already an architecture that addresses these flaws and that has been adopted by the majority of new DW projects. This is called the Bus Architecture (or Kimball Architecture), and I'm sure I'm not telling you anything new.

I'm curious to see if you will mention that in the next part of your article (and of course I'm very curious to see what is your suggested solution, although I can guess).

I'm sorry if this sounds contentious, maybe I'm a little biased because I know the prevailing view of b-eye site exsperts site on this matter.

Regards,

Andrea

Is this comment inappropriate? Click here to flag this comment.

Posted March 22, 2009 by Bruce Cassidy

Certainly some food for thought.  However, some of your points I don't agree with.

Let's take the example that springs to mind: "non-shareable specifications".  This is largely an issue with the toolsets, not with the "classic data warehouse" as such.  In fact, creating any sort of centralised data warehouse reduces that issue somewhat, as the varying tools can all point to the same databases, so at least they are working with the same objects.  I agree that most real world environments are heterogenous, but I would see this as an argument for a centralised data warehousing environment.

Secondly, there's more to redundancy than just transforming data.  It's also about removing the coupling between systems.  So if system A changes, suddenly we don't have a myriad of self-service reports that have all broken.  Also, if system A goes down, we don't also lose all of the reporting that relies on that data.

Thirdly, much of what happens within a data warehouse environment is about process management (quality management, master data management and so on).  You don't even address that.

So while you have some interesting points, they seem irrelevent without some sort of alternative.

Is this comment inappropriate? Click here to flag this comment.

Posted March 10, 2009 by Shawn Kung

Hi Rick - great points.  At Aster Data (an MPP data warehouse vendor), we share your beliefs.  A few thoughts

(1) Operational BI flaw - copying across multiple databases not only leads to the high latency issues you pointed out, but also creates data integrity risks (after so much copying and transformation across disparate databases in the CDWA).

(2) Redundancy flaw - excellent point.  A horizontally scaling MPP warehouse appliance can significantly reduce overlapping data - primarily because of the sheer computational power to parallel process granular data.  If you can do a complex ad-hoc query (eg. multi-way joins) off original tables in an MPP system way faster than in a CDWA, the need for aggregations and complex tuning falls dramatically (as well as redundant data costs).

(3) External, unstructured data - I do agree with the trend of growing analytics on both structured and unstructured data, but disagree that it must remain external.  Certainly a federation approach is useful in cases where the data has fleeting value, but an ELT approach provides the benefit of fast loading and the ability to leverage the parallelism of MPP to analyze patterns over longer histories.  Aster's In-Database MapReduce over structured/unstructured data in a distributed MPP environment has some amazing performance advantages, for example.

Thanks,

Shawn

Is this comment inappropriate? Click here to flag this comment.

Posted March 10, 2009 by Robert Eve

Rick -

The classic data warehouse architecture is a product of the business problems and technical solutions of its time. 

With the passage of time, many of those original business problems have been solved, meaning further investments in this architecture and these problems delivers lower marginal returns.

Further, time has enabled new solutions to come to the fore.  Many of these are based on new underlying technologies and techniques that are also new.    This enables new problem to be addresses as well as old problems to be addressed differently.

Enterprises and agencies have now have many options, including blending traditional approaches with new, for example federating sourcies before feeding the warehouse, federating additional sources beyond the warehouse, and/or service-enabling data universally.

I look forward to part 2.

- Bob Eve

Is this comment inappropriate? Click here to flag this comment.

Posted March 4, 2009 by Phil Bailey

Hi Rick,

A nice build-up and I'll look forward to the 'answer', but I think I can guess where youre going with it...

I'd like to add to your point about using one of the new appliances as you suggest in Flaw 3 to 'remove' the need to create mutiple layers withiin our warehouse design. I think that if you keep your load routines as raw as possible, and then use eLT where possible, then you can drastically reduce the amount of time it takes for the data to transfer between those layers. The added benefit being that you can keep those layers, but speed them up so you can still provide traceability of data and lineage through all the varous calculations that happen with the data. For the users to be able to see what happens to the data at every transform step is vital if you want to gain trust in your BI systems. If you are suggesting that these layers are removed then I don't see 'yet' how you would address this 'explanation' or 'auditability' of your data across the enterprise.

Is this comment inappropriate? Click here to flag this comment.