<-- Back to full color view
Data Warehouse Construction: Generator or Operator?
Originally published March 7, 2013
At the end of my article Data Warehouse Construction: Compiler, Interpreter and Operator, I pointed out that, from the interpreting mechanism standpoint, there is no essential difference between compilers and interpreters or generators and operators. From the software administration and operating perspective, however, the difference is substantial. In this article, we will have close look at it.
As a matter of fact, a metadata-driven generic operator (MGO), discussed in the article mentioned above, is nothing but a corresponding generator plus an executor. Fed with the given operative metadata, the generator generates scripts composed of SQL statements, and the executor executes these scripts immediately afterward. Here, no scripts generated are required to be kept. If we settle a switch onto the operator to dictate it to keep the scripts generated and not execute them, we transform the operator to a generator. We call such a switch the operating mode switch with two modes: "compiler" and "interpreter." This way, the same MGO can be applied in two different fashions.
This switch is quite practical. For an MGO in the normal operating situation, no generated scripts are kept after their execution. In this case, the MGO runs actually in the interpreter mode. If the scripts should be generated without execution, for instance, for failure analysis, we can let the MGO run in the compiler mode. In the following, we compare both modes in the context of software administration in order to figure out the most effective approach for data warehouse software administration in practice.
For the sake of simplicity and in order to conduct a comprehensible discussion, we make the following restrictive assumptions:
- A complex MGO is employed for the discussion. It can be switched between the two operating modes as defined above, depending on the observation needs.
- We want to update several hundred target tables with data extracted from several hundred source tables. Each pair of target/source tables for the updating corresponds to a generated script composed of several SQL statements with the compiler approach, whereas to an equivalent program invocation with the interpreter approach. There should be, hence, about 1,000 complex SQL scripts or 1,000 equivalent program invocations needed for a realistic enterprise data warehouse update.
It should be noticed that there is no difference between the two approaches with respect to the metadata acquisition.
The major property of this approach is that about 1,000 complex SQL scripts
will be generated and retained. Regarding these scripts, it must always be ensured that the following tasks are accomplished perfectly:
- We must always know by which version of the generator (i.e., the domain-generic knowledge) and from which version of the metadata (i.e., the object-specific knowledge), the scripts were generated respectively. Otherwise, we do not know what to do and what is wrong if an error in the scripts is detected.
- We must always know which scripts are to be replaced if:
- The generator has been modified or
- The metadata has been changed.
Otherwise, all scripts have to be replaced if anything has been changed, no matter whether it is the generator or metadata.
- We must always know:
- Where the scripts are stored and
- How they are named, possibly together with versioning information.
- We must ensure that the script invocations always have the correct reference in the schedule of the process management mechanism.
In short, we have to deal with:
- Two dimensions of versioning (generator and metadata) and
- Two locations of information (invocations and invocated scripts).
This task is by no means straightforward, especially when the number of scripts is considerable. On the other hand, the compiler approach provides certain flexibility
. This is because the scripts leave the control of the generator once they are distributed. This advantage shows particularly in the case of failure. That is, we can repair the defect script locally or manipulate it without affecting others.
The main characteristic of this approach is that no SQL scripts
will be generated at all. This has some convenient implications:
- There is only one dimension of versioning, i.e., that of metadata. A versioning of the operator in the production system is no longer necessary since the operator is always at its best and most current version there.
- There is only one location of information, i.e., that of invocations. It is no longer necessary to ensure the synchronization between the invocations and the invocated scripts.
Moreover, the approach still has the following additional properties in our case:
- The SQL statements that are composed and to be executed are stable. That is, they do not depend on specific data constellations or operational circumstances. As long as the related metadata and the operator are not changed, the SQL statements remain the same, no matter when, where, why and how they were generated – in the daily production, or during the failure analysis by switching the MGO in the compiler mode. Therefore, it is not necessary to store these SQL statements or scripts physically for later failure analysis. If you want to see what the executed scripts looked like, invoke the MGO in the compiler mode and you get the scripts exactly. This property makes the system more compact.
- Erroneous SQL statements, which were detected by failure analysis, can be corrected just as with the compiler approach. For this case, the corresponding operator invocation is replaced transiently with the invocation of the generated and corrected script until the next metadata and/or operator release. This way, the interpreter approach obtains a similar flexibility as well as with the compiler approach.
- Generating SQL statements certainly takes some time. However, this time portion is absolutely negligible in comparison with the time portion needed for the data processing in the data warehouse context.
To ensure that the operator is always at its best and most current version and works perfectly with regard to all invocations, it must be verified and tested for any change to it against these invocations. As a matter of fact, such verification and test are generally far from being as complicated and extensive as they might appear. The following procedure and mechanism can be utilized for an efficient correctness assurance:
- Run the old MGO in the compiler mode against all invocations;
- Store all generated SQL statements;
- Run the new MGO in the compiler mode against all invocations;
- Store all generated SQL statements;
- Compare both groups of SQL statements, and identify the difference;
- Test only the difference for the new MGO.
- Operators principally always keep the generic knowledge centralized. Thus, it is very easy to administrate the entire extract-transform-load (ETL) mechanism, i.e., the operative metadata, plus a very small number of MGOs. If well made, the MGOs themselves are generally quite small as well, as detailed in Constructing Data Warehouses with Metadata-Driven Generic Operators, and More (B. Jiang, DBJ Publishing, 2011).
- Generators distribute the generic knowledge as soon as the generated scripts are distributed. Therefore, it is quite challenging to administrate the whole ETL mechanism if the number of the scripts is considerable. If we want to take advantage of the so-called flexibility and permit modification of the generated scripts, which is unfortunately quite often the usual practice, we will find ourselves soon in the scenarios depicted in my articles Data Warehouse Construction: Behavior Pattern Analysis and Data Warehouse Construction: The Real Life. In such cases, the generators are often employed only as disposable aids for the first days of the data warehousing initiative. These days are mostly bright and merry. The rest of the life, however, will slowly become depressing and gloomy …
Therefore, we propose to use MGOs in the following ways:
- For operating, use them in the interpreter mode consistently. This is the major case.
- Only for failure analysis and for verifying acquitted operative metadata, switch them into the compiler mode.
SOURCE: Data Warehouse Construction: Generator or Operator?
Recent articles by Bin Jiang, Ph.D.