Channel: Data Warehouse Realization - Bin Jiang RSS Feed for Data Warehouse Realization - Bin Jiang

 

Data Warehouse Construction: Generator, Generic Knowledge and Operative Metadata

Originally published December 6, 2012

In my article Data Warehouse Construction: Behavior Pattern Analysis, we observed the usual constructional behaviors with a lab task of emptying all 1000 tables in a database "my_db." There, we showed that the employed approach was not effective and analyzed the reasons for this. In my article Data Warehouse Construction: The Real Life, we showed that the insights obtained there are relevant to the real life data warehouse construction today. In this article, we go back to the laboratory to consider some fundamentals of another type of constructional approach, the metadata-driven generic one.

Generator: Another Solution for the Lab Task

Assume that you know some basics of SQL (structured query language). Then, you should know what the following SQL statement does if it is executed in an SQL execution environment:
SELECT   'DELETE FROM my_db.' || tablename || ';'
FROM    my_catalog.tables
WHERE   databasename = 'my_db';
Actually, this statement perfectly generates all DELETE statements listed in the lab example discussed in Data Warehouse Construction: Behavior Pattern Analysis as the execution result in a fraction of one second! They are namely:
DELETE FROM my_db.tab_adfhkfha;
DELETE FROM my_db.tab_lkeasr;
...
DELETE FROM my_db.tab_dajpgh;
Here is a little technical detail for a better understanding of the story.
  • When being executed, the above SQL statement looks up table "tables" located in the catalog database "my_catalog," fetches all records in it about the tables that are hosted in the database "my_db" and takes the names of these tables (tablename) in the hand. For each of these table names, it constructs a character string by concatenating three string portions: "DELETE FROM my_db.," the current value of the variable "tablename" from the 1000 table names just taken in the hand and the character ";" for a syntactically correct end of every SQL statement.

  • Before we do this, we need to put information about the related tables into the table catalog, i.e., table "tables" located in the catalog database "my_catalog." Each of these tables has a corresponding record in this catalog table with at least two columns, "databasename" and "tablename," denoting the hosting database and the table in consideration, respectively.
It is easy to verify that each of such a character string corresponds to one of the DELETE statements desired. In this sense, the SQL statement in consideration is a statement generator, although it is a simple one. What you still need to do is simply execute these generated statements. Note that there is no cold-blooded "copy," no powerful "paste," no clever "search," no elegant "replace," no noxious "adjust" and no skeptical "verify" anymore. In short, there is no operations repetition (discussed in Data Warehouse Construction: Behavior Pattern Analysis) at all!

Now, let us have a close look at the components of this statement generator. The first portion 'DELETE FROM my_db.' and the third one ';' of the complete character string to be constructed by the generator are constant. If the traditional approach is applied, they are the repeated contents; in particular, they are the major motivation for applying the operations pair "copy & paste" 999 times. In the terms introduced in the article mentioned above, they are carriers of certain domain-generic knowledge, regarding the SQL-syntax and regarding the programming style conventions in our case. The second portion is variable. Its concrete value, i.e., the table name, is specific for each individual table. With the traditional approach, this portion is where operations chain "search-replace-adjust-verify" is applied, again 999 times. In the terms introduced in the article mentioned above, this portion represents the carriers of certain object-specific knowledge, i.e., the table name in our case.

What are the essential differences between the two approaches regarding constructional behaviors?

Domain-Generic Knowledge

With the new approach, the carriers of domain-generic knowledge are touched/edited only once, instead of the additional 999 times of manual "copy & paste" with the traditional approach. In other words, the domain-generic knowledge is centralized and, thus, not distributed with the new approach. Although the 1000 DELETE statements generated contain these carriers, they are in principle not to be read, understood and changed, just as with a piece of binary code. This way, the carriers of generic knowledge remain to be such, and no noxious "adjust" can change this state.

In Data Warehouse Construction: Behavior Pattern Analysis, we introduced domain-generic knowledge in a very thrifty form and used it to analyze the behaviors in data warehouse construction. In general, with generic knowledge we mean such that shows a general validity and applicability in a given domain of interest. Therefore, we call it domain-generic knowledge. A domain in our context is a well-defined system area such as a source application, the data warehouse, and so forth. Representatives of the domain-generic knowledge are the data warehouse architecture guidelines, the programming style conventions and, in particular, the algorithms employed. In fact, the latter generally represents the major portion of the domain-generic knowledge involved in data warehouse construction. As an important matter of fact, the architect team produces nothing but the domain-generic knowledge.

Object-Specific Knowledge

With the new approach, we know exactly where the object-specific knowledge is to be placed. Thus, there is no need for the operations pair "search & replace." Moreover, we know exactly which object-specific knowledge is to be used, and we always get the correct one. Hence, there is no need for the operations pair "adjust & verify." In short, there is no need for the expensive and time-consuming operations repetition analyzed in Data Warehouse Construction: Behavior Pattern Analysis. Therefore, the new approach is substantially more effective than the traditional one.

But how do we obtain such object-specific knowledge? With the second point of the above explanation, we could have left an impression that we would manually put information about the related tables into the catalog table. If this would be the case, our life as data warehouse constructors would not be improved substantially with the new approach. This is because the manual acquisitions are generally error-prone as well, and we have to "adjust & verify" their results individually.

Actually, we obtain this information perfectly for free! When a table is created in a given host database such as "my_db" using any database management system available today, information like table name and the name of the database hosting this table is stored by the system automatically and perfectly as a record in the so-called system catalog table "tables" or the like. If the table is dropped later, this record will be removed from the system catalog table automatically and completely. In short, a table exists in a database, if and only if there is a corresponding record in the system catalog table "tables." This information exists in the system catalog without being perceived by us, and many people are indeed not aware of its existence. To make the story perfect, therefore, the above statement generator should look like
SELECT   'DELETE FROM my_db.' || tablename || ';'
FROM     system_catalog.tables
WHERE   databasename = 'my_db';
where "my_catalog" is replaced by "system_catalog."

Actually, the so-called "information" mentioned above, regardless of whether it is stored in "my_catalog" or in "system_catalog," is nothing but our object-specific knowledge. As a matter of fact, the object-specific knowledge, in turn, is nothing but the so-called operative metadata.

Operative Metadata

Operative metadata defines operative/system objects and the relationships among them in the system, and determines the system behavior or state therewith. It is, thus, indispensable for the well-functioning of the system in consideration. Examples of operative metadata are the column list of a table, or the column mappings from a source table to a target table. The former is usually stored in the system catalog, e.g., in "system_catalog" above, and maintained by the system automatically, whereas the latter is stored in the user/tool-defined catalog, e.g., in "my_catalog" above, and maintained by the system constructors manually. For business users, operative metadata is data that is stored in the systems somewhere. To them, it is data they do not always understand and are, therefore, not interested in.

It is noteworthy that operative metadata is object-specific by definition. As a matter of fact, the activities in constructing data warehouses that are not repetitive are related with operative metadata. It has to be treated individually and specifically for each of the concrete objects, e.g., tables or mappings. This is the reason why we said previously that the object-specific knowledge is nothing but the operative metadata.

In the terms explained above, the new constructional approach type mentioned at the beginning of this article, i.e., the metadata-driven generic one, could be now called an operative-metadata-driven domain-generic-knowledge-centralized approach type. This designation is a little long but complete.

In this article, we observed actually only one of the possibilities of this approach type. In the upcoming articles, we will investigate another possibility, an even more effective one. In fact, this approach type is nothing but a new constructional paradigm, in the sense of Thomas Samuel Kuhn (1922 – 1996), as will be discussed in my next article.

References

SOURCE: Data Warehouse Construction: Generator, Generic Knowledge and Operative Metadata

  • Bin Jiang, Ph.D.Bin Jiang, Ph.D.
    Dr. Bin Jiang received his master’s degree in Computer Science from the University of Dortmund / Germany in 1986. In 1992, he received his doctorate in Computer Science from ETH Zurich / Switzerland. During the research period, two of his publications in the field of database management systems were awarded as the best student papers at the IEEE Conference on Data Engineering in 1990 and 1992.

    Afterward, he worked for several major Swiss banks, insurance companies, retailers, and with one of the largest international data warehousing consulting firms as a system engineer, software developer, and application analyst in the early years, and then as a senior data warehouse consultant and architect for almost twenty years.

    Dr. Bin Jiang is a Distinguished Professor of a large university in China, and the author of the book Constructing Data Warehouses with Metadata-driven Generic Operators (DBJ Publishing, July 2011), which Dr. Claudia Imhoff called “a significant feat” and for which Bill Inmon provided a remarkable foreword. Dr. Jiang can be reached by email at bin.jiang@bluewin.ch

    Editor's Note: You can find more articles from Dr. Bin Jiang and a link to his blog in his BeyeNETWORK expert channel, Data Warehouse Realization.

Recent articles by Bin Jiang, Ph.D.



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!