We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Data Warehouse Construction: The Real Life

Originally published October 11, 2012

In my article, Data Warehouse Construction: Behavior Pattern Analysis, the typical behavior patterns for constructing the extract, transform and load (ETL) mechanism were analyzed in a lab-like isolated and strongly simplified context in order to make the essential aspects clear and comprehensible. In this article, insights obtained there will be applied to reviewing data warehouse construction in real life.

If we would only have a small number of individual manually made ETL programs to care about, as with usual operational applications, our lives would be quite pleasant. In an average organization, however, there may be dozens of source applications for the enterprise data warehouse;and each of them may deliver up to hundreds of source tables. Consequently, the entire ETL mechanism may consist of thousands of ETL programs, part of which can be quite sophisticated. This is another major characteristic of the ETL mechanism, the quantitative one. It says, together with the qualitative one pointed out in my article mentioned above, that the ETL mechanism of an average enterprise data warehouse usually consists of a large number of fairly similar but not identical programs.

While the significant similarity can be exploited to increase the constructional productivity, the huge amount of programs converts the data warehouse construction to a real challenge. It is because in addition to the troublesome programming (scripting, coding, "modeling," etc.), all of these programs have to be designed, specified, tested, documented and maintained as well. Such a huge amount of work cannot be accomplished by a single person within a short period. Instead, large design and development teams for a long period are usually necessary.

A Large Team …  

Employing a large number of developers brings challenges to the data warehouse construction process. These developers have different grasps and understandings and get used to different thinking manners and working styles; they master the development tools and aids at different levels and govern the database management systems and SQL (structured query language) in different degrees. As a consequence, the programs delivered by one developer will not always be directly understandable and, thus, maintainable by others, even if they are made according to the same or similar design specifications and run perfectly in production. There is simply too much personality immersed in the delivered programs. In other terms, the generic knowledge, mostly implied in the design specifications, can be interpreted by means of programming in different fashions by different persons, even though somehow all of them are correct. In this manner, the actually existing genericity is hard to recognize and, thus, becomes certain pseudo specificity by perception. More concretely, the carrier of certain generic knowledge, e.g., a portion of program, made by one developer would appear to be a carrier of some specific knowledge for others. Consequently, they will also treat it as such.

A Long Period …  

A long period means that even if the programs are developed by the same developer but at different time points, that developer’s applied layouts, implied styles, employed skills and  immersed personal habits change, albeit slowly. It is quite often that a developer has difficulty understanding programs he developed two years ago. Last but not least, we often observe that standards, conventions or guidelines are forgotten or mistaken because of the time effect on our memory. In short, a developer today is not the same as he was two years ago, as philosophers already noted several thousand years ago, and neither are his "interpretations" of the same design specification. In this sense, if we have in fact 10 developers in a team for two years confronting design specifications, we would have in effect 20 or 30 different developers delivering interpretations. This should be a noteworthy qualitative dimensional transformation: from a value of the time dimension to a value of the material/personnel dimension.

As a matter of fact, the time aspect affects much more than just this. As time passes:
  • The business requirements and technological circumstances change;

  • The involved stakeholders (sponsors, managers, architects, designers, developers, testers, etc.) fluctuate; 

  • The old data sources are gone and new ones come;

  • The originals of the program copies, i.e., the "genericity," can be mended, tuned, improved, optimized and even redesigned and remade. 
If the copies would remain unchanged, the situation would not be so bad. Any changes at the original can be namely re-copied and re-pasted, i.e., propagated, to the existing copies again in a controlled manner by applying "search-replace-adjust-verify." However, the usual practice shows again and again that the copies themselves can also be mended, tuned, improved, optimized, and extended individually for any of a thousand reasons.

At the very beginning, it could be clear what is common and generic and what is individual and specific. We could thus efficiently work in an industrial manner with the powerful operations chain analyzed in Data Warehouse Construction: Behavior Patterns Analysis. At this time, the relationship between the original and its copies was strong and unambiguous. At the end of the day, however, everything is mixed. No one knows that there was ever an "original/copy" relationship between them. Now, all of them stand alone, similar to orphans. Thus, all later coming changes have to be treated individually and directly at the copies.

More than Programming   

In fact, the above observations are also valid with the design process and the design specifications delivered by different designers in a large design team for a long period. As a matter of fact, the editing behavior patterns discussed previously in Data Warehouse Construction: Behavior Patterns Analysis  are not only limited in programs and programming, specifications and specifying. Their extensive occurrence can also be observed in the documentations and documenting. Furthermore, the documentations, originally produced efficiently by applying the well-mastered editing operations chain, should be updated as soon as any change to the programs has been made during their development or maintenance. Similarly as described and argued above, as the time passes, this becomes more and more challenging. One additional reason for this is that seamless updates of program documentation are not an absolute and compelling necessity for the proper functioning of programs in production, although they are vital for maintenance and extension. Obviously, without considerable efforts, all programs will become undocumented eventually, at least psychologically.

Suddenly, we would find ourselves in the flea market of Amsterdam: a huge heap of individually handmade and undocumented artisan crafts, whose maintenance and extension demand an unreasonable amount of time and money. This is the reality in which the data warehouse constructors live today. Last but not least, we have to point out that all this is independent of whether or not certain (ETL-) tools or aids now available are employed.

As a matter of fact, the major issue here consists in the inappropriate treatment of the carriers of the generic knowledge existing in the ETL mechanism. For any of reasons discussed above and more, their forms are so changed that the carried generic knowledge cannot be perceived and exploited as such. This way, generic knowledge and specific knowledge are so mixed that the former has to be regarded and treated as specific knowledge too. Treatments of individual and specific things are almost always, quite frequently substantially, more expensive and troublesome. This is the real cause of the tragedy.

Summary

  • The ETL mechanism of an average enterprise data warehouse usually composes a large number of fairly similar but not identical programs.

  • There exists a considerable amount of domain generic knowledge in the ETL mechanism because of this significant similarity.

  • With today’s popular constructional approaches, regardless of whether tools or aids are employed, this generic knowledge is distributed unconsciously in every program in an uncontrolled manner.

  • Due to the remarkable quantity of the ETL programs and the long-lived and dynamic nature of data warehouses, a considerable amount of changes to these programs is inevitable.

  • These changes lead eventually to an inseparable mixture of the carriers of generic knowledge with those of specific knowledge. This, in turn, lets everything appear to be certain specific knowledge.

  • Specific knowledge has to be treated specifically and thus individually. Such treatments are expensive, time-consuming and risky, just as we have been experiencing for decades.

  • To avoid producing the carriers of pseudo specific knowledge, generic knowledge should never be distributed (B. Jiang, 2011). In governance terms, only a single version of the interpretation of the generic knowledge is allowed. This is the key idea of a completely new and effective approach for data warehouse construction.
How this approach would look and how effective it would be are the topics that I will treat in my upcoming articles.

  • Bin Jiang, Ph.D.Bin Jiang, Ph.D.
    Dr. Bin Jiang received his master’s degree in Computer Science from the University of Dortmund / Germany in 1986. In 1992, he received his doctorate in Computer Science from ETH Zurich / Switzerland. During the research period, two of his publications in the field of database management systems were awarded as the best student papers at the IEEE Conference on Data Engineering in 1990 and 1992.

    Afterward, he worked for several major Swiss banks, insurance companies, retailers, and with one of the largest international data warehousing consulting firms as a system engineer, software developer, and application analyst in the early years, and then as a senior data warehouse consultant and architect for almost twenty years.

    Dr. Bin Jiang is a Distinguished Professor of a large university in China, and the author of the book Constructing Data Warehouses with Metadata-driven Generic Operators (DBJ Publishing, July 2011), which Dr. Claudia Imhoff called “a significant feat” and for which Bill Inmon provided a remarkable foreword. Dr. Jiang can be reached by email at bin.jiang@bluewin.ch

    Editor's Note: You can find more articles from Dr. Bin Jiang and a link to his blog in his BeyeNETWORK expert channel, Data Warehouse Realization.

Recent articles by Bin Jiang, Ph.D.

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!