We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Data Warehouse Construction: Forces Driving Architectural Evolution

Originally published May 2, 2013

Roughly speaking, an architectural engagement for a data warehouse is a functionality placement onto its components. In order to have a better understanding of your data warehouse today and work out a meaningful picture for its tomorrow, it is helpful to review the architectural evolution road that data warehouses generally have followed in the past and know the possibilities leading to the future. Here, it is particularly helpful to understand the major forces that drove the evolution.

In this article, we focus on the placement and treatment evolution of three major functionalities within data warehouses, i.e., extracting data (E), transforming data (T) and loading data (L). In most cases, this placement decides the effectiveness of the resulting data warehouses.

The Old Time: Manual ELT

In the good old days, when the term "data warehouse" was not yet established, the notion was decision support systems. At that time, there were no specific, professional tools for constructing such information platforms on which the operational data was collected, integrated, stored, and analyzed. For these information platforms, data was treated in a natural way:
  • Extract. The relevant operational data was extracted somehow from the operational applications and transferred to a projecting system of the information platform or to the information platform directly.

  • Load. This data was loaded in the database on the information platform using the load utilities provided by the corresponding database management system.

  • Transform. This data was then transformed with SQL programs (manually developed scripts, procedures, programs with pre-processors, etc.) and stored in the database on the information platform for the subsequent querying and analysis.
This is referred to as the ELT workflow and illustrated in Figure 1. There are two points worth mentioning:
  • Generally, the information platform is the database server, i.e., a computer system on which the data is stored in database tables and processed and managed by the underlying database management system.

  • The operation "load" refers solely to the initial filling of the database tables on the information platform with the data in consideration, no matter where it comes from – from files or from database tables on other systems.


Figure 1: The Architecture in the Old Time with Manual ELT

The main strength of this ELT architecture is its high performance, in particular in the transformation phase on the information platform. The following are the major reasons:
  • Special construction. It is typical that large data volumes are processed, stored and, in particular, queried later on the information platforms for analysis. Thus, database servers used for information platforms are usually so designed and constructed that they are able to process large volumes of data very efficiently. For this purpose, special architecture, data structures, algorithms and hardware are employed.

  • Generous configuration. As a rule, the hardware for each concrete information platform is generously configured, aiming at efficient query processing as well as effective transformation.
It should be pointed out that even today there are many data warehouses that are still living in this "old time," mainly because of the high performance requirement. The disadvantages of this manual treatment appear especially in the development and administration of the programs:
  • The productivity of the programs is generally low and their quality is typically poor.

  • The administration of several hundreds or even thousands of manually developed programs always represents a challenge.

  • The documentation of good quality remains to be desired.

The New Time: Tool-Aided ETL

Increasing the productivity and quality of a large number of programs and facilitating their administration and documentation in this context were the driving forces for the invention of numerous tools, i.e., the data integration tools, in the recent decades, as an instance of the so-called CASE (computer-aided software engineering) movement. Most of these tools had the following similarities:
  • Graphical user interfaces for program development were made available in the hope of increasing productivity and quality and facilitating the program administration and documentation.

  • The runtime systems of these tools worked principally in an odd way:
  • Extract the data from the data sources (i.e., flat files or database tables) and bring it to the ETL server (i.e., a separate computer system) on which the tool processed and transformed the data, instead of to the information platform directly.

  • Transform the data on the ETL server row by row.

  • Load the transformed data to the target database tables on the information platform for the subsequent querying and analysis.


Figure 2: The Architecture in the New Time with Tool-Aided ETL

This so-called ETL architecture, as illustrated in Figure 2, was conceived essentially in favor of the tool vendors – the complex and sophisticated functionalities of the tools could be designed and developed independently of the diverse requirements of the multifarious information platform systems. This architecture had the following snags in practice:
  • Weak configuration. Regarding the processing capacity, the ETL server was configured typically much weaker than the database server (for real-time data warehouses, the situation may be different) since it was usually not employed for the subsequent analysis.

  • Unsuitable technology. The "row by row" processing style, which was typical with ETL servers, is not sufficiently efficient, especially when being required to process large data volume. Processing large data volume is, however, characteristic in the data warehouse practice. Processing large data volume efficiently is, therefore, one of the main requirements for professional database management systems hosting data warehouses.

  • Busy data transfer. The data to be processed was transferred in both directions across the system boundaries through network connections. In many cases, these connections were not sufficiently powerful. This reduced the processing performance additionally.
All these led to the performance issue. To overcome this issue, deleterious measures were taken in many relatively large data warehouses. Such measures were, in principle, comparable to "cutting the toes for the feet to fit the shoes." One of the tricks often encountered was the so-called "side door" – manually developed and tested complete SQL statements were copied into the pre- or post-processing areas of the tools. Such SQL statements went then directly from database to database without any essential contacts with the ETL server. Two significant effects:
  • There was no metadata about the SQL statements stored in the data integration tools so that there was no valid or up-to-date documentation about the processing – a bad or even worse solution with regard to the topic of software documentation.

  • The manually developed and tested complete SQL statements had to be embedded into the tool environment and tested once again in the target context – additional but nontrivial work leading to worse productivity.

The Modern Time: Tool-Aided ELT

Efforts for completely overcoming the performance challenge led finally to the fundamental revision of the existing data integration tools and to the invention of new ones in the recent years. Many existing representative data integration tools, just as the new ones, have been moving in the following revised ELT direction, as shown in Figure 3
  • Extract. The relevant operational data is extracted from the operational applications and transferred to a projecting system of the information platform or to the information platform directly, possibly under the control of the data integration tool.

  • Load. This data is loaded in the database on the information platform under the control of the data integration tool.

  • Transform. This data is then transformed on the information platform under the control of the data integration tool, and stored in the database on the information platform for the subsequent querying and analysis. 

Figure 3: The Architecture in the Modern Time with Tool-Aided ELT

This way, the advantages that should be achieved using graphical user interfaces seem to be realized, and the performance challenge has been overcome. However, issues regarding productivity, documentation, software quality and administration are not really resolved, even in the case of completely new data warehouses.

It has to be pointed out that for the existing data warehouses based on the "new time" ETL tools, the new generation of tools, i.e., the "modern time" ELT tools, cannot help very much. This is because the existing SQL statements hidden behind the "side door" cannot be automatically converted into the metadata desired by the tools. The expense of a manual reengineering of all such hidden SQL statements equals more than a half of expense for constructing a completely new data warehouse on the same scale.

The Future Time: MGO-Based ELT

In my previous articles in this series, a new approach, i.e., the metadata-driven generic operator one (MGO), for constructing data warehouses was introduced in detail. Its workflow, as illustrated in Figure 4, is exactly the same as the "old time" architecture described at the beginning of this article, i.e., ELT, but without tools. Thus, it has the same and, notably, the best performance. However, it does not have any of the weaknesses of its ancestor. As detailed in the above-mentioned articles, it is even much better than the approaches supported by the modern and professional data integration tools on the market regarding all critical aspects listed here:
  • It improves the productivity significantly.

  • It improves the software quality significantly.

  • It improves the documentation quality significantly.

  • It is significantly easier to administrate the programs.

  • It does not require license fee.


Figure 4: The Architecture in the Future Time with MGO-Based ELT

As a matter of fact, this approach makes constructing data warehouses with a sophisticated architecture affordable, which, in turn, leads to high data quality and usability even with very unfavorable data sources. This way, it makes data warehouse construction no longer a big issue, even if they are expected to be of any type described in Defining Data Warehouse Variants by Classification and to meet all requirements enumerated in Refining the Enterprise Data Warehouse Definition by Requirements Categorization.

It is noteworthy that from the architectural perspective, there is no essential evolutional movement or improvement here. All substantial improvements enumerated above are achieved by applying a completely new constructional approach/paradigm introduced by my article series mentioned above. In other words, as soon as the architecture is determined, the constructional approach is decisive for the real success of the undertaking. A detailed and extensive description of the MGO paradigm can be found in my book Constructing Data Warehouses with Metadata-driven Generic Operators.

A Pictorial Summary

Figure 5 summarizes the four evolutional phases of data warehouse architecture discussed in this article along with the five driving forces. As a matter of fact, the validity of the discussion results obtained in this article is not limited only to data warehouses. They are also valid with regard to the so-called enterprise data integration in general.

 

Figure 5: Four Evolutional Phases of Data Warehouse Architecture (B. Jiang, 2011)

A Stock Market Recommendation

Go short against the data integration tool vendors if they do not change their constructional paradigms, even in the context of the so-called "big data." (Note that it may take a relatively long time to see the change take effect.)

  • Bin Jiang, Ph.D.Bin Jiang, Ph.D.
    Dr. Bin Jiang received his master’s degree in Computer Science from the University of Dortmund / Germany in 1986. In 1992, he received his doctorate in Computer Science from ETH Zurich / Switzerland. During the research period, two of his publications in the field of database management systems were awarded as the best student papers at the IEEE Conference on Data Engineering in 1990 and 1992.

    Afterward, he worked for several major Swiss banks, insurance companies, retailers, and with one of the largest international data warehousing consulting firms as a system engineer, software developer, and application analyst in the early years, and then as a senior data warehouse consultant and architect for almost twenty years.

    Dr. Bin Jiang is a Distinguished Professor of a large university in China, and the author of the book Constructing Data Warehouses with Metadata-driven Generic Operators (DBJ Publishing, July 2011), which Dr. Claudia Imhoff called “a significant feat” and for which Bill Inmon provided a remarkable foreword. Dr. Jiang can be reached by email at bin.jiang@bluewin.ch

    Editor's Note: You can find more articles from Dr. Bin Jiang and a link to his blog in his BeyeNETWORK expert channel, Data Warehouse Realization.

Recent articles by Bin Jiang, Ph.D.

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!