In my article, Data Warehouse Construction: How Would Great Engineers Have Done It?, I pointed out that it is characteristic for data warehouses that their extract, transform and load (ETL) mechanism composes a set of fairly similar but not identical programs. There, I also emphasized that if we can effectively construct the ETL mechanism, we have in principle settled the data warehouse construction challenge as well. In order to find an effective way for constructing data warehouses, let us at first analyze the typical programming process during ETL mechanism construction.
A Lab Task
For an understandable discussion, let us consider a rudimentary task that every data warehouse constructor must confront almost every day, for instance, in preparation for receiving newly extracted source data. Although this task is trivial, it contains all essential conceptional elements that we will investigate thoroughly later.
Assume that we have a database "my_db" containing 1000 tables, e.g., "tab_adfhkfha," "tab_lkeasr," … and "tab_dajpgh." We want to empty all these tables. We know that SQL (structured query language) statements according to the syntax "DELETE FROM < databasename > . < tablename > ;" are appropriate for this task. Now, how would you accomplish this task? One of the possibilities is writing down the following 1000 fairly similar but not identical statements and executing them afterwards:
DELETE FROM my_db.tab_adfhkfha;
DELETE FROM my_db.tab_lkeasr;
DELETE FROM my_db.tab_dajpgh;
When doing this, you might copy
the first written statement, and paste it 999 times. Then, you adjust
certain parts of each of these pasted statements like table names individually. Of course, you have to verify
whether the adjustments are correct afterwards. The question is now, how much time would you need to empty these tables completely? If you are not very lucky, it may take you several tedious hours. Please notice each italicized verb here.
Now, let us have a close look at the above process. The first thing standing out should be the repetitions. There are two kinds of repetitions here:
- Contents: The phrase "DELETE FROM my_db." of the complete statements
- Operations: The editing operations required to produce the subsequent 999 DELETE statements, i.e., "copy (the first statement)-paste (it 999 times)-search (the table names 999 times)-replace (the old table name with the new one 999 times)-adjust (if necessary 999 times)-verify (whether the resulting statement is correct 999 times)."
The Contents Repetition
Why is there the first kind of repetition? It is because the requirement for the mentioned phrase is generic
for all these DELETE statements. It must look like this due to the SQL-syntax, and perhaps due to the programming style convention: all SQL-reserved words must be written in capital. In any other forms, the statements may not work well or might look awful for the developer team involved. We consider such repeated contents to be the carriers of the "domain-generic knowledge
." With domain
we mean a limited activity area in consideration.
The opposite question is now what does not repeat. In our emptying example, it is the table names that do not repeat at all. Each statement contains its unique specific
table name. We call such things the carriers of the "object-specific knowledge
." In our case, a table is an object
In these terms, we put the point regarding the first kind of repetition in short: The carriers of the domain-generic knowledge repeat, while those of the object-specific knowledge do not
As a matter of fact, Ford’s assembly-line approach is based on the observation that there are many components of automobiles that are correspondingly identical – in our terms, they are respective carriers of certain generic knowledge – and thus can be simply "copied & pasted," i.e., repetitively and identically produced or made. With his assembly-line approach, Ford had reached a revolutionarily high constructional effectiveness. On the other hand, he did not handle the object-specific knowledge. This is, however, typical with the ETL mechanism due to its characteristic as noted in the first sentence of this article. Therefore, his approach cannot be transferred directly to our data warehouse construction.
The Operations Repetition
The second kind of repetition is the well-used editing operations chain, or the most frequently encountered modern behavior pattern: "copy-paste-search-replace-adjust-verify
"Copy & paste" anything using computer editors may be one of the greatest and most powerful inventions with regard to increasing productivity since the printing technique was invented several thousand years ago. We data warehouse constructors take full advantage of it to produce ETL programs fast by exploiting their characteristics. (If we do not manage to do this, we might have already lost the war before we begin it.) On the other hand, no ETL programs to be developed should be identical even if they are generally fairly similar. Therefore, we have to "search" – another great and powerful invention together with "replace" – the places where they should contain other content, and "replace" this old content with the new one automatically
. It is usually necessary to manually
"adjust (modify / change / optimize / etc.)" the new program somehow afterwards if the involved programs do not seem to be sufficiently similar. To "verify" the correctness of the results is a must in all cases.
Why do we need "copy & paste?" It is because there is a considerable portion that the old and the new programs should share due to their significant similarity. In other words, there are some "fat" things generic for both. If the common portions would not be "fat" enough, no "copy & paste" would be applied. Why can we "search & replace" automatically? It is because we know exactly what should differentiate them from each other. In other words, we know that there is something specific
for each individually but on a regular basis
. Why do we have to "adjust" the resulting program manually? It is because there may be something else also specific that is not so easy to be treated simply by "search & replace." If we are not able to perceive or identify a significant exploitable similarity between both programs, neither "copy & paste" nor "search & replace" will be applied to the construction of the new program. In this case, the long operations chain degenerates to a short one, i.e., "adjust & verify." Represented in another fashion, we would construct our ETL programs with a classical typewriter, like Ford would have done 100 years ago, or with a fine nib like Archimedes would have done 2000 years ago. Why is it necessary to "verify" the program individually? It is because "search & replace" may do a little more than desired, and any manual operation may induce mistakes. I would speculate that Henry Ford and his design team had also carried out a similar analysis of the production behaviors of their workers before they designed their assembly lines.
There are some relationships here:
- The operations repetition is caused or motivated by the contents repetition because of the existence of some significant and exploitable similarity, commonality or genericity.
- All subsequent four operations in the operations repetition are induced by its first two operations – "copy & paste" – due to the existence of something that is not identical, different or specific.
Put it in short, the fact that there exist some "fat" generic things is decisive for this treatment behavior pattern, and many such things do exist in the ETL programs.
More on the Operations
The operations pair "copy & paste" is extremely powerful, and "paste" is the greatest one as long as its meaningful applicability can be identified. Ford took advantage of it excellently for automobile construction, which led eventually to the industrial revolution. On the other hand, there are some requirements restricting its general effectiveness. The first one is that the copies must not be individually changed later. If it is not the case, a cheap convenient industrial affair will turn into an expensive troublesome artisan one. That is the reason why the vendors ensure the warranty only if the cameras or laptops they delivered have not been opened or changed by the customers themselves. The other issue is that the original or the design must be perfect. If it contains any errors, flaws or the like, the great "paste" distributes them uncompromisingly. The more frequently the "copied" portion is "pasted," the broader such flaws are distributed. Occasionally, we experience callbacks for "pasted" cars of type X because of some flaws in its “original," i.e., in the design. The catastrophe occurs when some of the distributed flaws have been changed or "corrected" in an uncontrolled manner, i.e., by the customers individually.
If a copied common portion, i.e., a carrier of certain generic knowledge (genericity
in short), has been changed individually through the so-called "adjust," it is not a copy of the original anymore. It becomes, or at least appears to be, a carrier of some new specific knowledge (specificity
in short). The more frequently "adjust" is applied, the more specificity exists in your ETL program set. The more specificity exists in your ETL program set, the more like a set of artisan crafts your ETL programs will look. Artisan crafts are expensive in general and their treatments are troublesome. In this sense, we say that "adjust" is noxious. At the end of the day, you cannot distinguish what was generic and what was not since everything is mixed, although at the beginning, your ETL program set might have been produced in an industrial manner by "copy & paste" (like Ford with his automobiles) and "search & replace" (something Ford couldn't do because he didn't have a corresponding operations pair). During the day, it is a set of fine artisan crafts, but at night it may be the subject of a Mary Shelley nightmare.
The above behavior analysis was carried out in a lab-like isolated and simplified context. In my next article, I will use the insights obtained here to observe the data warehouse construction in real life. All this analysis will help us in developing an effective approach for data warehouse construction.
SOURCE: Data Warehouse Construction: Behavior Pattern Analysis
Recent articles by Bin Jiang, Ph.D.