A refrigerator is based on certain thermal technology and used as an infrastructure for a household to cool meat, vegetables, and the like. Similarly, a data warehouse is an infrastructure based on the information technology for an organization to integrate, collect, and prepare data on a regular basis for easing analysis. Now, let us have a closer look at the terms used here.
An infrastructure provides certain functionalities for their users, such as a refrigerator or a highway. Our infrastructure is based on the information technology. This means that it is generally composed of hardware (CPUs, memory, storage, networks, etc.), base software (operating systems, database management systems, tools), and special software (programs for data extraction, load, transformation, etc.). This also means that it is not based on anything else like bones, papyrus, silks, paper receipts, bricks, etc.
The user of this infrastructure is the organization, such as a company, government, association or nonprofit-organization. The organization employs the functionalities that this infrastructure provides in order to ease its data analysis or to make it effective. Here are two noteworthy points:
- The infrastructure should provide certain functionalities to ease data analysis that are not provided by other infrastructures within the organization. In other words, the functionalities provided by the infrastructure should be unique.
- The effectiveness aspect here is essential. If data analysis wasn’t required to be quick, smooth, accurate, correct, etc., no additional infrastructure would be necessary. Similarly, if driving and transportation weren’t required to be fast, safe, and energy efficient, no highways would have to be constructed. We will come back to effectiveness and discuss it in more detail later.
Generally, infrastructures themselves do not add new “things” to the existing things – no new information, no new meat or vegetable, no new cars. They only facilitate or ease our life. This way, they add new “value” to the existing things. The following are unique functionalities a data warehouse can provide to the organization:
- Data integration. Almost every business today is enabled by a set of IT-based operational applications. These operational applications form the business enabler of the organization. They are generally used, owned, and supervised by different departments and have different emergence and development histories. They are usually based on different technologies and distributed in different locations. In short, they are scattered in diverse spaces within the organization – the subject space, the political space, the technological space, the evolutional space, and the geographical space. Moreover, each of these operational applications represents only a small part of the business enabling instead of representing the entire one in the organization. Most of these operational applications produce data as a byproduct, recording the business state. It is this byproduct, occasionally complemented by some external data, that is the data source or the raw material of the data warehouse. The first and most important functionality required from a data warehouse is putting such widely scattered data together in a unified form – structurally and semantically – to make them effectively comprehensible. In this sense, it’s considered a spatial functionality of the data warehouse. Apart from the data warehouse, there should not be any other type of infrastructure within the organization that provides this functionality to such an extent as its main purpose. This is a prerequisite for effectively solving the “single version of the truth” problem.
- Data collection. This is a temporal functionality of the data warehouse. At any time, the data produced by an operational application represents a part of the state of the business run by the organization; all such parts form a complete snapshot of the entire business. The collection of all such business snapshots, ordered along the time axe, i.e., a series of temporally chained business snapshots, represents the history of the business. Although some operational applications keep the data themselves for some time due to business needs like claim treatments, this data is not available online forever. If the enterprise IT architecture is well designed, the data warehouse should be the only infrastructure within the organization that keeps the collected snapshots available online for the whole organization as long as the business requires. This is a prerequisite to make effective “predictive modeling”: from the past into the future.
If the data warehouse only contains “correct” snapshots (the erroneous ones will be replaced by the correct ones as soon as they are detected) representing the historical business state, the data warehouse is unitemporal. If the erroneous ones must not be removed and are retained in the data warehouse together with the corresponding correct ones, the data warehouse is bitemporal. Inmon describes this as nonvolatile. Bitemporal data warehouses contain not only business state snapshots, but also snapshots representing the past data state of the data warehouses themselves. Note that this is only an intuitive application of the bitemporality. A precise definition of this term is given in my book, Constructing Data Warehouses. Tritemporality is also discussed there.
- Data preparation. The previous two functionalities are objective, physical, measurable, and thus quantitative. For instance, my data warehouse is sourced by 35 operational applications and keeps a 20-year history of their data. Conversely, the functionality of data preparation is a subjective and qualitative one, and thus cannot be measured in the same way. Each source application has its own operating environment, is based on its own technology, and has to meet its own business and applicational requirements. As a consequence, the data they produce has its own individual look and characteristics. In general, these do not always meet the requirements for analysis purposes. For instance, you want to see the data in a denormalized form, but it is in a normalized form. You want to know the state of the business one year ago, but you have to separate all irrelevant data from the data you need. Thus, to ease the subsequent analysis, the data has to be prepared so as to meet the analysis requirements. The quality of the preparation can be “judged” by its effectiveness, another qualitative and subjective term. For example, if prior to a trip you researched traffic on every road and could arrive at every reserved hotel on time with everything you needed on hand, then your preparation was effective. Otherwise, the preparation was ineffective.
As a matter of fact, the term “effectiveness” is the key to the whole story. “For easing analysis” simply means “for making analysis effective.” During an analysis, if you get every data set you need at once
(space), every historical data snapshot you require immediately
(time), and all this data is in the requested form and quality and delivered in time
(usability), your analysis should be easy and effective since you can concentrate on the very analysis itself without any waiting or further complex and laborious preprocessing. In short, all three functionalities discussed serve the analysis effectiveness.
Last but not least, operating this infrastructure is not a disposable affair – the data collection runs “on a regular basis.”
It may be carried out daily, weekly or monthly. It may also run in a real-time mode; as soon as the data is produced, it is collected by the data warehouse. Whichever mode or modes the data warehouse works in depends on the business and analysis requirements.
In Figure 1
, the functionalities discussed here are summarized graphically and considered in a three-dimensional way. Each of these functionalities corresponds to a dimension:
- Data integration => space dimension
- Data collection => time dimension
- Data preparation => usability dimension
Figure 1: The three-dimensional mission of data warehousing (B. Jiang, 2011)
Here is a noteworthy point regarding definitional techniques. Instead of using adjectives, which are mostly used in the popular data warehouse definitions enumerated in the first article of this series, only verbs are applied in the proposed definition. The reason is that adjectives are usually more subjective, literary, and thus ambiguous, whereas verbs are more objective, restrictive, concrete, and hence precise. This is especially valid for specifying functionalities of engineered objects.
Is this revised definition of data warehouses accurate to you? If yes, please use it in the future and do not create the 1002nd one! If no, please let me know what is missing.
In the next article of this series, I will introduce a classification of data warehouses and discuss diverse data warehouse variants.
Recent articles by Bin Jiang, Ph.D.