<-- Back to full color view

Defining Data Warehouse Variants by Classification

Originally published June 25, 2012

In my last article, I defined a data warehouse as an infrastructure based on the information technology for an organization to integrate, collect, and prepare data on a regular basis for easing analysis. This is a general definition. In our engineering practice, there are diverse data warehouse variants. In this article, I propose a classification system as a tool for easing our future communication and work orientation in the data warehousing field.
Using a well-designed classification system to define related terms is advantageous. The semantics implied in the relationships among these terms, not only static and positional ones but also dynamic and evolutional ones, can be exploited to make the definitions less ambiguous, more understandable and easier to remember. Such relationships form the “context” that provides semantics for the terms to be defined. A simple application of this definitional technique is using a complementary pair of terms, which we will use extensively in the following discussion.

In my opinion, there can be four orthogonal classifications of data warehouses in our practice. The first is according to the topological characters, the second is in accordance with the organizational attributes, the third considers the temporal aspects and the fourth observes things from the geographical perspective.

Topological / Back-End Classification

This classification takes the back-end constellation of the data warehouses into consideration, i.e., the data source side. There are two classes:

Single-source data warehouses. A data warehouse belongs to this class if it has only one source application. This class contains the simplest data warehouses of all. Because there is only one operational application delivering data to the data warehouse, nothing has to be done for the issue of data integration, which is in general a challenging task when constructing data warehouses. On the other hand, the data has to be transformed more or less to meet the representation requirements set by the analysis needs.

There are several special cases encountered quite frequently in data engineering. If the system neither transforms nor collects the data, (i.e., the data it keeps is only a replica of that from the source application), this system is not a data warehouse, even if the data is used for analysis. If the system transforms but does not collect the data at all, (i.e., absolutely no history of the operational data from the source application is maintained there), the system is not a data warehouse. If the system collects but does not transform the data, (i.e., a history of the operational data is maintained online for analysis purposes), the system is not a data warehouse; it is an operational data store.

Today, many standard operational applications have their own data warehouse extensions for subsequent analysis of the executed operations. All of them can be considered as single-source data warehouses.

Multi-source data warehouses. A data warehouse belongs to this class if it is not a single-source warehouse. When talking about data warehouses, we mean mostly multi-source warehouses since one of the most important functionalities a data warehouse provides is data integration. From an evolutionary perspective, the single-source warehouse represents the primary phase of the data warehousing initiative of the organization. If the job is well done, the mature and final stadium is a multi-source data warehouse. In other words, the evolution is unidirectional, from single-source to multi-source.

Organizational / Front-End Classification

This classification is based on the front-end constellation of the data warehouses, i.e., the end user side. There are two classes:

Departmental data warehouses. If the data warehouse is utilized mainly by a part of the organization, it is a departmental data warehouse. In data warehousing literature, it is also called a data mart. Within a big organization, there can be many independent departmental data warehouses. They are used, owned and supervised independently by the respective “departments.”  One of the major issues with maintaining multiple departmental data warehouses within the organization is the inconsistency of the data from different data warehouses. The other is the high total cost of ownership of all these data warehouses.

Enterprise data warehouses. If the data warehouse is employed by the whole organization for analysis purposes, it is an enterprise data warehouse. The enterprise data warehouse represents the target stadium of the data warehouse evolution within the organization. The following stages are typical along the unidirectional evolution path within an organization:
  1. A lot of departmental data warehouses exist to meet the departmental analysis needs.

  2. The enterprise data warehouse appears and coexists with the departmental data warehouses to solve the “single version of the truth” and “total cost of ownership” issues previously mentioned.

  3. The enterprise data warehouse exists alone within the organization.
Since the enterprise data warehouse is of strategic importance for the organization, it must meet a lot of fundamental requirements, which will be discussed in my next article.

Temporal / Freshness Classification

This classification takes the state of the data warehouse content into account, i.e., their change modes. There are two classes:

Periodical data warehouses. If updating the data warehouse content is carried out periodically – for example, daily or weekly – it is a periodical data warehouse. The updating is independent of the source data generation and the period is defined beforehand in accordance with the business needs. Data warehouses, whose updating is triggered by the system administrator, belong to this class as well. In principle, this can also be called demand-driven updating.

Real-time data warehouses. If the content of the data warehouse is updated very shortly after the source data is generated by the operational applications  for instance, if it is updated only a few minutes after the source data generation or only after the generation of a small, defined number of data rows it is a real-time, or more accurately, a near real-time data warehouse. The content updates of these data warehouses are driven by the source data generation; in other words, generation-driven updating. The data freshness of data warehouses in this class is crucial for the near real-time business enabling and the so-called tactical decisions.

Most often, a data warehouse begins its life with demand-driven updating. Success stories and effectiveness attract more users, who in turn generate more demands and set higher data freshness requirements. This eventually leads to a complete real-time data warehouse. During the evolution, there may be a long time with a mixture of updating modes, partially periodically and partially real-time.

Geographical / Location Classification

This classification considers the geographical location modes of the data warehouses. There are two classes:

Distributed data warehouses. If the major data objects of the data warehouse – or parts of them – are logically based on the same model but stored and processed physically at different geographical locations, it is a distributed data warehouse. If the underlying data models are different and independent of each other, the data warehouses in consideration are departmental. This class of data warehouses is especially meaningful for big international organizations.

Centralized data warehouses. If all data objects of the data warehouse are processed and stored physically at one geographical location, it is a centralized data warehouse.

Different from the previous three classes, in which the evolutions are unidirectional, the evolution here is bidirectional, completely dependent of the business development and requirements.

Figure 1 is a summary of the classifications.

Figure 1: A Classification System of Data Warehouse Variants (B. Jiang, 2011)

The four classification criterions applied in the system are orthogonal to each other. Therefore, we can use a combination of the four initials of the characteristic words for each of the classes to typify a data warehouse. For instance, an MEPC data warehouse means a Multi-source, Enterprise, Periodic, and Centralized data warehouse. This way, we get 16 (=2x2x2x2) different types of data warehouses. We call this classification and typifying system the DWTI-system (Data Warehouse Type Indicator).

SOURCE: Defining Data Warehouse Variants by Classification

  • Bin Jiang, Ph.D.Bin Jiang, Ph.D.
    Dr. Bin Jiang received his master’s degree in Computer Science from the University of Dortmund / Germany in 1986. In 1992, he received his doctorate in Computer Science from ETH Zurich / Switzerland. During the research period, two of his publications in the field of database management systems were awarded as the best student papers at the IEEE Conference on Data Engineering in 1990 and 1992.

    Afterward, he worked for several major Swiss banks, insurance companies, retailers, and with one of the largest international data warehousing consulting firms as a system engineer, software developer, and application analyst in the early years, and then as a senior data warehouse consultant and architect for almost twenty years.

    Dr. Bin Jiang is a Distinguished Professor of a large university in China, and the author of the book Constructing Data Warehouses with Metadata-driven Generic Operators (DBJ Publishing, July 2011), which Dr. Claudia Imhoff called “a significant feat” and for which Bill Inmon provided a remarkable foreword. Dr. Jiang can be reached by email at bin.jiang@bluewin.ch

    Editor's Note: You can find more articles from Dr. Bin Jiang and a link to his blog in his BeyeNETWORK expert channel, Data Warehouse Realization.

Recent articles by Bin Jiang, Ph.D.



Want to post a comment? Login or become a member today!

Be the first to comment!


Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC