Defining Data Warehouse Variants by Classification
Originally published June 25, 2012
In my last article, I defined a data warehouse as an infrastructure based on the information technology for an organization to integrate, collect, and prepare data on a regular basis for easing analysis. This is a general definition. In our engineering practice, there are diverse data warehouse variants. In this article, I propose a classification system as a tool for easing our future communication and work orientation in the data warehousing field.
Topological / Back-End ClassificationThis classification takes the back-end constellation of the data warehouses into consideration, i.e., the data source side. There are two classes:
Single-source data warehouses. A data warehouse belongs to this class if it has only one source application. This class contains the simplest data warehouses of all. Because there is only one operational application delivering data to the data warehouse, nothing has to be done for the issue of data integration, which is in general a challenging task when constructing data warehouses. On the other hand, the data has to be transformed more or less to meet the representation requirements set by the analysis needs.
There are several special cases encountered quite frequently in data engineering. If the system neither transforms nor collects the data, (i.e., the data it keeps is only a replica of that from the source application), this system is not a data warehouse, even if the data is used for analysis. If the system transforms but does not collect the data at all, (i.e., absolutely no history of the operational data from the source application is maintained there), the system is not a data warehouse. If the system collects but does not transform the data, (i.e., a history of the operational data is maintained online for analysis purposes), the system is not a data warehouse; it is an operational data store.
Today, many standard operational applications have their own data warehouse extensions for subsequent analysis of the executed operations. All of them can be considered as single-source data warehouses.
Multi-source data warehouses. A data warehouse belongs to this class if it is not a single-source warehouse. When talking about data warehouses, we mean mostly multi-source warehouses since one of the most important functionalities a data warehouse provides is data integration. From an evolutionary perspective, the single-source warehouse represents the primary phase of the data warehousing initiative of the organization. If the job is well done, the mature and final stadium is a multi-source data warehouse. In other words, the evolution is unidirectional, from single-source to multi-source.
Organizational / Front-End ClassificationThis classification is based on the front-end constellation of the data warehouses, i.e., the end user side. There are two classes:
Departmental data warehouses. If the data warehouse is utilized mainly by a part of the organization, it is a departmental data warehouse. In data warehousing literature, it is also called a data mart. Within a big organization, there can be many independent departmental data warehouses. They are used, owned and supervised independently by the respective “departments.” One of the major issues with maintaining multiple departmental data warehouses within the organization is the inconsistency of the data from different data warehouses. The other is the high total cost of ownership of all these data warehouses.
Enterprise data warehouses. If the data warehouse is employed by the whole organization for analysis purposes, it is an enterprise data warehouse. The enterprise data warehouse represents the target stadium of the data warehouse evolution within the organization. The following stages are typical along the unidirectional evolution path within an organization:
Temporal / Freshness ClassificationThis classification takes the state of the data warehouse content into account, i.e., their change modes. There are two classes:
Periodical data warehouses. If updating the data warehouse content is carried out periodically – for example, daily or weekly – it is a periodical data warehouse. The updating is independent of the source data generation and the period is defined beforehand in accordance with the business needs. Data warehouses, whose updating is triggered by the system administrator, belong to this class as well. In principle, this can also be called demand-driven updating.
Real-time data warehouses. If the content of the data warehouse is updated very shortly after the source data is generated by the operational applications – for instance, if it is updated only a few minutes after the source data generation or only after the generation of a small, defined number of data rows – it is a real-time, or more accurately, a near real-time data warehouse. The content updates of these data warehouses are driven by the source data generation; in other words, generation-driven updating. The data freshness of data warehouses in this class is crucial for the near real-time business enabling and the so-called tactical decisions.
Most often, a data warehouse begins its life with demand-driven updating. Success stories and effectiveness attract more users, who in turn generate more demands and set higher data freshness requirements. This eventually leads to a complete real-time data warehouse. During the evolution, there may be a long time with a mixture of updating modes, partially periodically and partially real-time.
Geographical / Location ClassificationThis classification considers the geographical location modes of the data warehouses. There are two classes:
Distributed data warehouses. If the major data objects of the data warehouse – or parts of them – are logically based on the same model but stored and processed physically at different geographical locations, it is a distributed data warehouse. If the underlying data models are different and independent of each other, the data warehouses in consideration are departmental. This class of data warehouses is especially meaningful for big international organizations.
Centralized data warehouses. If all data objects of the data warehouse are processed and stored physically at one geographical location, it is a centralized data warehouse.
Different from the previous three classes, in which the evolutions are unidirectional, the evolution here is bidirectional, completely dependent of the business development and requirements.
Figure 1 is a summary of the classifications.
Figure 1: A Classification System of Data Warehouse Variants (B. Jiang, 2011)
The four classification criterions applied in the system are orthogonal to each other. Therefore, we can use a combination of the four initials of the characteristic words for each of the classes to typify a data warehouse. For instance, an MEPC data warehouse means a Multi-source, Enterprise, Periodic, and Centralized data warehouse. This way, we get 16 (=2x2x2x2) different types of data warehouses. We call this classification and typifying system the DWTI-system (Data Warehouse Type Indicator).
Recent articles by Bin Jiang, Ph.D.
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC