We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Data Warehouse Construction: Behavior Pattern Analysis

Originally published September 13, 2012

In my article, Data Warehouse Construction: How Would Great Engineers Have Done It?, I pointed out that it is characteristic for data warehouses that their extract, transform and load (ETL) mechanism composes a set of fairly similar but not identical programs. There, I also emphasized that if we can effectively construct the ETL mechanism, we have in principle settled the data warehouse construction challenge as well. In order to find an effective way for constructing data warehouses, let us at first analyze the typical programming process during ETL mechanism construction.

A Lab Task

For an understandable discussion, let us consider a rudimentary task that every data warehouse constructor must confront almost every day, for instance, in preparation for receiving newly extracted source data. Although this task is trivial, it contains all essential conceptional elements that we will investigate thoroughly later.

Assume that we have a database "my_db" containing 1000 tables, e.g., "tab_adfhkfha," "tab_lkeasr," … and "tab_dajpgh." We want to empty all these tables. We know that SQL (structured query language) statements according to the syntax "DELETE FROM < databasename > . < tablename > ;" are appropriate for this task. Now, how would you accomplish this task? One of the possibilities is writing down the following 1000 fairly similar but not identical statements and executing them afterwards:
DELETE FROM my_db.tab_adfhkfha;
DELETE FROM my_db.tab_lkeasr;
DELETE FROM my_db.tab_dajpgh;
When doing this, you might copy the first written statement, and paste it 999 times. Then, you adjust certain parts of each of these pasted statements like table names individually. Of course, you have to verify whether the adjustments are correct afterwards. The question is now, how much time would you need to empty these tables completely? If you are not very lucky, it may take you several tedious hours. Please notice each italicized verb here.

Now, let us have a close look at the above process. The first thing standing out should be the repetitions. There are two kinds of repetitions here:
  1. Contents: The phrase "DELETE FROM my_db." of the complete statements

  2. Operations: The editing operations required to produce the subsequent 999 DELETE statements, i.e., "copy (the first statement)-paste (it 999 times)-search (the table names 999 times)-replace (the old table name with the new one 999 times)-adjust (if necessary 999 times)-verify (whether the resulting statement is correct 999 times)."

The Contents Repetition

Why is there the first kind of repetition? It is because the requirement for the mentioned phrase is generic for all these DELETE statements. It must look like this due to the SQL-syntax, and perhaps due to the programming style convention: all SQL-reserved words must be written in capital. In any other forms, the statements may not work well or might look awful for the developer team involved. We consider such repeated contents to be the carriers of the "domain-generic knowledge." With domain we mean a limited activity area in consideration.  

The opposite question is now what does not repeat. In our emptying example, it is the table names that do not repeat at all. Each statement contains its unique specific table name. We call such things the carriers of the "object-specific knowledge." In our case, a table is an object.

In these terms, we put the point regarding the first kind of repetition in short: The carriers of the domain-generic knowledge repeat, while those of the object-specific knowledge do not.

As a matter of fact, Ford’s assembly-line approach is based on the observation that there are many components of automobiles that are correspondingly identical – in our terms, they are respective carriers of certain generic knowledge – and thus can be simply "copied & pasted," i.e., repetitively and identically produced or made. With his assembly-line approach, Ford had reached a revolutionarily high constructional effectiveness. On the other hand, he did not handle the object-specific knowledge. This is, however, typical with the ETL mechanism due to its characteristic as noted in the first sentence of this article. Therefore, his approach cannot be transferred directly to our data warehouse construction.

The Operations Repetition

The second kind of repetition is the well-used editing operations chain, or the most frequently encountered modern behavior pattern: "copy-paste-search-replace-adjust-verify."

"Copy & paste" anything using computer editors may be one of the greatest and most powerful inventions with regard to increasing productivity since the printing technique was invented several thousand years ago. We data warehouse constructors take full advantage of it to produce ETL programs fast by exploiting their characteristics. (If we do not manage to do this, we might have already lost the war before we begin it.) On the other hand, no ETL programs to be developed should be identical even if they are generally fairly similar. Therefore, we have to "search" – another great and powerful invention together with "replace" – the places where they should contain other content, and "replace" this old content with the new one automatically. It is usually necessary to manually "adjust (modify / change / optimize / etc.)" the new program somehow afterwards if the involved programs do not seem to be sufficiently similar. To "verify" the correctness of the results is a must in all cases.

Why do we need "copy & paste?" It is because there is a considerable portion that the old and the new programs should share due to their significant similarity. In other words, there are some "fat" things generic for both. If the common portions would not be "fat" enough, no "copy & paste" would be applied. Why can we "search & replace" automatically? It is because we know exactly what should differentiate them from each other. In other words, we know that there is something specific for each individually but on a regular basis. Why do we have to "adjust" the resulting program manually? It is because there may be something else also specific that is not so easy to be treated simply by "search & replace." If we are not able to perceive or identify a significant exploitable similarity between both programs, neither "copy & paste" nor "search & replace" will be applied to the construction of the new program. In this case, the long operations chain degenerates to a short one, i.e., "adjust & verify." Represented in another fashion, we would construct our ETL programs with a classical typewriter, like Ford would have done 100 years ago, or with a fine nib like Archimedes would have done 2000 years ago. Why is it necessary to "verify" the program individually? It is because "search & replace" may do a little more than desired, and any manual operation may induce mistakes. I would speculate that Henry Ford and his design team had also carried out a similar analysis of the production behaviors of their workers before they designed their assembly lines.

There are some relationships here:
  • The operations repetition is caused or motivated by the contents repetition because of the existence of some significant and exploitable similarity, commonality or genericity.

  • All subsequent four operations in the operations repetition are induced by its first two operations – "copy & paste" – due to the existence of something that is not identical, different or specific.
Put it in short, the fact that there exist some "fat" generic things is decisive for this treatment behavior pattern, and many such things do exist in the ETL programs.

More on the Operations

The operations pair "copy & paste" is extremely powerful, and "paste" is the greatest one as long as its meaningful applicability can be identified. Ford took advantage of it excellently for automobile construction, which led eventually to the industrial revolution. On the other hand, there are some requirements restricting its general effectiveness. The first one is that the copies must not be individually changed later. If it is not the case, a cheap convenient industrial affair will turn into an expensive troublesome artisan one. That is the reason why the vendors ensure the warranty only if the cameras or laptops they delivered have not been opened or changed by the customers themselves. The other issue is that the original or the design must be perfect. If it contains any errors, flaws or the like, the great "paste" distributes them uncompromisingly. The more frequently the "copied" portion is "pasted," the broader such flaws are distributed. Occasionally, we experience callbacks for "pasted" cars of type X because of some flaws in its “original," i.e., in the design. The catastrophe occurs when some of the distributed flaws have been changed or "corrected" in an uncontrolled manner, i.e., by the customers individually.

If a copied common portion, i.e., a carrier of certain generic knowledge (genericity in short), has been changed individually through the so-called "adjust," it is not a copy of the original anymore. It becomes, or at least appears to be, a carrier of some new specific knowledge (specificity in short). The more frequently "adjust" is applied, the more specificity exists in your ETL program set. The more specificity exists in your ETL program set, the more like a set of artisan crafts your ETL programs will look. Artisan crafts are expensive in general and their treatments are troublesome. In this sense, we say that "adjust" is noxious. At the end of the day, you cannot distinguish what was generic and what was not since everything is mixed, although at the beginning, your ETL program set might have been produced in an industrial manner by "copy & paste" (like Ford with his automobiles) and "search & replace" (something Ford couldn't do because he didn't have a corresponding operations pair). During the day, it is a set of fine artisan crafts, but at night it may be the subject of a Mary Shelley nightmare.

The above behavior analysis was carried out in a lab-like isolated and simplified context. In my next article, I will use the insights obtained here to observe the data warehouse construction in real life. All this analysis will help us in developing an effective approach for data warehouse construction.

  • Bin Jiang, Ph.D.Bin Jiang, Ph.D.
    Dr. Bin Jiang received his master’s degree in Computer Science from the University of Dortmund / Germany in 1986. In 1992, he received his doctorate in Computer Science from ETH Zurich / Switzerland. During the research period, two of his publications in the field of database management systems were awarded as the best student papers at the IEEE Conference on Data Engineering in 1990 and 1992.

    Afterward, he worked for several major Swiss banks, insurance companies, retailers, and with one of the largest international data warehousing consulting firms as a system engineer, software developer, and application analyst in the early years, and then as a senior data warehouse consultant and architect for almost twenty years.

    Dr. Bin Jiang is a Distinguished Professor of a large university in China, and the author of the book Constructing Data Warehouses with Metadata-driven Generic Operators (DBJ Publishing, July 2011), which Dr. Claudia Imhoff called “a significant feat” and for which Bill Inmon provided a remarkable foreword. Dr. Jiang can be reached by email at bin.jiang@bluewin.ch

    Editor's Note: You can find more articles from Dr. Bin Jiang and a link to his blog in his BeyeNETWORK expert channel, Data Warehouse Realization.

Recent articles by Bin Jiang, Ph.D.



Want to post a comment? Login or become a member today!

Be the first to comment!