Oops! The input is malformed! Data Warehouse Construction: How Would Great Engineers Have Done It? by Bin Jiang, Ph.D. - BeyeNETWORK
We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Channel: Data Warehouse Realization - Bin Jiang RSS Feed for Data Warehouse Realization - Bin Jiang


Data Warehouse Construction: How Would Great Engineers Have Done It?

Originally published August 23, 2012

Assume that the "single version of the definition" for data warehouses has been settled with my article series Start Small: Single Version of the Definition. Since thus far I have not received any comments or objections to the proposed definition, I will assume that "peace, justice and prosperity," concerned by Confucius, have been defended more or less in this way. This seems to be a good start. On the other hand, it can be a fairly long way from a great idea to its effective realization. Perhaps for this reason, it is necessary for "eight philosophers to confer about how Nietzsche would have done it before they change a light bulb." Obviously, they believe in the wisdom that "sharpening the saw prior to sawing the wood will lead to easier and faster work." Now, let us first sharpen our saws by guessing how two of the greatest engineers in history would have constructed data warehouses.

A Short Introduction to the ETL Mechanism of Data Warehouses

The ETL (extract, transform, load) mechanism of a data warehouse is a set of programs for extracting data from the source data sets, transforming this data and loading the transformed data into the target tables in the data warehouse for subsequent analysis. It is the key organ of a data warehouse for meeting his three major functional requirements enumerated in my article Refining the Enterprise Data Warehouse Definition by Requirements Categorization. Besides, its importance for a running data warehouse is comparable with that of the cardiovascular system for any living member of the kingdom Animalia. The design, development, test, maintenance and extension of this mechanism represent the main challenge for the data warehouse construction, especially in a fast-changing world. In this sense, we can say that if we can effectively construct the ETL mechanism, we have, in principle, settled the data warehouse construction challenge as well. In other words, both challenges are essentially equal.

In the past three decades, almost everything imaginable has been tried, including different methodologies, architectures, models, tools, aids, etc. However, the achievement still appears somewhat unsatisfactory: The construction remains financially expensive, operatively time-consuming and politically risky. This has a psychologically braking effect on the organizations employing the data warehouse technology for improving their performance. They cannot think big enough because they know and have experienced that they cannot effectively do it big enough.

However, the ETL mechanism as a program set can always be observed in two clear ways:
  • Horizontal. Each program in the set is responsible for a single target table with corresponding extraction, transformation and load steps, just like its name: E-T-L.

  • Vertical. The whole program set consists of three subsets. The first one is for extraction, the second one is for transformation and the third one is for load (again, just like its name: E-T-L).
In other words, for all target tables fairly similar but not identical  programs need to be constructed. This is characteristic for the ETL mechanism and, thus, for the data warehouse construction.

Archimedes and Henry Ford Constructing the ETL Mechanism 

Now, let us consider how Archimedes of Syracuse (c. 287 BC – 212 BC), the greatest engineer of antiquity, and Henry Ford, a pioneer automobile engineer and sponsor of the assembly line production technique, would have constructed the ETL mechanism for 1000 target tables. Assume that the task must be accomplished within a very short time interval (e.g., one day) and, on the other side, that there are an unlimited number of program developers available.
During the Siege of Syracuse (c. 214–212 BC), Archimedes tried to destroy enemy ships with fire using the "Archimedes heat ray," consisting of hundreds of highly polished bronze or copper shields acting as mirrors. It was used to focus sunlight onto approaching ships, causing them to catch fire, as illustrated by Figure 1.

Figure 1: An Illustration of Archimedes' Heat Ray

Archimedes would probably have considered the ETL program for each target table as a mirror shield, and thus divided the available program developers into small developer teams. Each of these teams would have been assigned to producing a small number of completely functioning ETL programs, each for a target table, according to its succinct functional specification: E, T and L, together with a tiny and pragmatist footnote: "White cat, black cat, whichever catches mice is a good cat."

Henry Ford (1863 –1947) might have taken another way, the way of his automobile production, and applied his assembly line approach, as shown in Figure 2. He would have divided the entire task of ETL into smaller subtasks, like E, T and L, considered these as work step types, and assigned each of these step types to a single developer team. In other words, each developer team would have treated only one small member of the entire work step chain, but for all target tables.
In short, Archimedes would have partitioned the task horizontally, while Ford would have done this vertically.

Each of the Archimedes’ program developers would have had an overview of their own ETL programs and would have been a generalist. Moreover, he would have had fun with his work, since the programs are not required to be identical and the succinct design specification can and will be inevitably interpreted individually, creatively and thus differently. As a consequence, the program production would have been expensive, time-consuming and, thus, ineffective; and the developers could not have been easily exchanged between the teams because the programs there would have looked quite different.

Because all programs they would have produced in a team would have been almost identical, Ford’s developers would have been highly trained specialists for their assigned small tasks. As a consequence, they would have delivered high quality results efficiently, i.e., they could have worked extremely effectively. On the other hand, they would not really have been aware of the big picture. Therefore, their work would have been monotone and boring.

Archimedes was ingenious at designing his weapons and is regarded as one of the greatest engineering designers of all time. In principle, however, he was merely a great artisan, similar to Leonardo da Vinci (1452 – 1519). In contrast, Ford was smart, not only at designing automobiles, but also at the design of the whole production system for automobiles. He was a great meta-designer, who began the assembly line industry epoch. Archimedes had fun, whereas Ford had effectiveness, i.e., high productivity with high quality and low costs.

For real life data warehouses, the ETL mechanism is not as straightforward as described by my above scenarios. We have identified more than two dozen functional task types for the ETL mechanism. Some of them are fairly sophisticated and challenging. Moreover, their applicability is strongly dependent on the data source situation and on the requirements on the data warehouse in consideration, which can be complex and challenging as well. Even for a single task type there may be several variants.
Working generally in the Archimedes' manner, we had lots of fun constructing data warehouses due to the interpretation freedom in the past three decades. Quite frequently, however, our fun turned into a Mary Shelley nightmare. In other words, Archimedes’ way has proven to be the wrong one for us. Ford did not present any freedom of interpretation to his employees. His approach is based on the simple observation that the works assigned to each team can be accomplished absolutely identically. This is also the major design principle of his assembly lines. However, the ETL programs are not identical, although they can be fairly similar. In short, Ford’s way is not the right one for us either. Does an effective and applicable way exist for us? In my next article, I will analyze what does happen when we are doing ETL and give an answer to this question.

Gong Cheng Shi Constructing … Houses

There is some noteworthy "big data" to consider regarding the latest advancements around constructional techniques by Chinese engineers (Gong Cheng Shi). Do you see any similarity between Ford's assembly line technique and the constructional technique that Gong Cheng Shi applied to construct a 30-story hotel in just 15 days (Figure 3)? Can we construct data warehouses at a similar velocity and with a comparable quality?

Figure 3: The 30-story Ark Hotel in Chiangsa, China was built in just 15 days

References :

SOURCE: Data Warehouse Construction: How Would Great Engineers Have Done It?

  • Bin Jiang, Ph.D.Bin Jiang, Ph.D.
    Dr. Bin Jiang received his master’s degree in Computer Science from the University of Dortmund / Germany in 1986. In 1992, he received his doctorate in Computer Science from ETH Zurich / Switzerland. During the research period, two of his publications in the field of database management systems were awarded as the best student papers at the IEEE Conference on Data Engineering in 1990 and 1992.

    Afterward, he worked for several major Swiss banks, insurance companies, retailers, and with one of the largest international data warehousing consulting firms as a system engineer, software developer, and application analyst in the early years, and then as a senior data warehouse consultant and architect for almost twenty years.

    Dr. Bin Jiang is a Distinguished Professor of a large university in China, and the author of the book Constructing Data Warehouses with Metadata-driven Generic Operators (DBJ Publishing, July 2011), which Dr. Claudia Imhoff called “a significant feat” and for which Bill Inmon provided a remarkable foreword. Dr. Jiang can be reached by email at bin.jiang@bluewin.ch

    Editor's Note: You can find more articles from Dr. Bin Jiang and a link to his blog in his BeyeNETWORK expert channel, Data Warehouse Realization.

Recent articles by Bin Jiang, Ph.D.



Want to post a comment? Login or become a member today!

Be the first to comment!