Is Inmon's Data Warehouse Definition Still Accurate?

Originally published May 10, 2012

 “A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision-making process.” Bill Inmon (1992)

The following analysis may appear to be too detailed. This is, however, necessary for establishing an accurate, objective and professional orientation of the data warehousing industry for the future. For this purpose, I will break down Inmon’s definition.

“A data warehouse is a … collection of data …” Is this statement template accurate? If yes, what do you think about other applications of this template, such as “a house is a collection of people,” or “a refrigerator is a collection of meat and vegetables”? If you say, “Yesterday, I bought a refrigerator,” do you mean that you bought a collection of chilled meat and vegetables? I don’t think so. But if you say, for example, “a house is a collection of laid bricks,” it would make much more sense. Anyway, a container cannot equal the collection of its contents. Mathematically, a set does not equal the elements contained in the set even if the set consists of only one element.

As a matter of fact, a house, as well as a refrigerator, is a mechanism or, more concretely, an infrastructure, that contains some elements and provides certain functionalities to the elements it contains for certain purposes.

Consider the all-purpose phrase “… in support of management’s decision-making process” in a bank or retail chain. Bank tellers and call center personnel use data warehouses today for supporting decision making but they are not considered management. Since an infrastructure for supporting decision making can have many components, the phrase “decision making” alone does not tell us very much about the concrete components involved. If we would say “… in support of money-making”, it would sound much better and more attractive from a marketing perspective. We need to use words that are more concrete and restrictive for describing our purposes. In my opinion, “querying”, “reporting” and “analysis” are better choices, and “analysis” is the best of all: Without analysis, the querying and reporting do not make much sense and without analysis, no good decision can be expected. It is analysis that challenges our brains and good analysis enables a good decision. Thus, analysis is the key activity component in support of decision making, and the data involved should be so prepared that the analysis can be carried out easily and smoothly.

Now, let us consider the four characteristic adjectives, i.e., “subject-oriented,” “integrated,” “time-variant” and “nonvolatile.” 

  • Subject-oriented: What does “subject” mean here? Customers, products and services? Revenues, incomes and cash flows? What is the opposite of the term “subject” here? Is it “process” or “transaction?” Is this feature special, unique, a must-criterion for data warehouses? Are there any other data systems within the organization that are subject-oriented too? (I think yes, quite a lot.) Does this mean an approach, a technics, a special architecture, or a recommendation for the organization of the data stored in the data warehouse for making analysis more effective? If this is the case, such a characteristic word should not be used for a general definition. Of cause, we could define something like “subject-oriented data warehouse.” But what is the opposite? “Process-oriented data warehouse?”

  • Integrated: This is the best and most unambiguously understood and realized characteristic of data warehouses. Almost every definition reviewed in my last article contains some characteristic words implied more or less by this one. Examples are “collected,” “centralized” “non-original,” “uploaded,” “specifically structured,” and “in a consistent format.” As a matter of fact, this feature is a must for any true data warehouse system. In other words, if a system cannot integrate data from different sources by transforming it into consistent forms and semantics, the system does not deserve the title of “data warehouse.” Moreover and from the perspective of the enterprise IT infrastructure, the (enterprise) data warehouse is the only component within the organization that provides the functionality of integrating data, and so supplies an integrated and performant / effective view of the enterprise data for the purpose of analysis. This is comparable to the refrigerator, the only infrastructure in the household that provides the functionality of cooling meat and vegetables. If you have multiple (departmental) data warehouses in your organization, the so-called independent data marts, you have multiple such data integrators as your IT infrastructure for each of your “departments.”

  • Time-variant: This is a difficult characteristic word. Roughly speaking, this means that the state of the data collection as an entirety depends explicitly upon time. In principle, the state of the collection should not change by itself. This state will change only if the involved source data changes and the changes there are “uploaded” or “gathered” into the collection. How frequently the uploading happens depends on the analysis requirements of the freshness of the data in the collection. There are two basic possibilities to understand and realize this requirement:
  1. When new data is uploaded, the old data is replaced by the new data.

  2. When new data is uploaded, the old data is not replaced by the new data but kept. That is, an appending happens.
Because of the fourth characteristic word, “nonvolatile”, which we shall discuss soon, the first possibility is unacceptable for Inmon’s data warehouse. Thus, we will only consider the second one. In this case, we could imagine each such uploaded data set as a data snapshot at a certain time point. This way, the collection is of temporally-chained data snapshots representing the history of the data involved.

To make the collection time variant, each snapshot uploaded into the collection has to have a timestamp. This indicates the time point when the snapshot was generated or when the snapshot was uploaded into the collection. Furthermore, the data is not simply stacked along the time axe. It is integrated by applying diverse transformation techniques. For instance, to integrate data, the data has to be joined in diverse ways. Have you paid attention to the time aspect? Do you join data uploaded one week ago with data just uploaded today while taking their business time semantics into account? If yes, are all these joins semantically correct? You might say: “No,  I don’t make it so complex. I only upload the source data on a regular basis and do not keep the older one.” Excellent! This is, however, just the first possibility mentioned above and Inmon’s definition requires the fourth feature: “nonvolatile.”
  • Nonvolatile: This is perhaps the most challenging requirement, especially when it is considered with the aspects of “integrated” and “time-variant.” It requires that the information, meaning, or semantics at a given time in the source application a data element carried when it was uploaded into the data warehouse remains unchanged and will be kept so forever, even if its representation for the integration purpose in the data warehouse has been changed or its original has been changed or removed later from the source application. Its first and quick understanding may be that the data element must not be removed from the data warehouse. But it is much more than that if the data warehouse is effectively used. This means, in fact, that this data element in the data warehouse has to keep additional information about its validity time interval; from a certain time point to another it was valid, and after that, it was not valid anymore because its original was replaced by a correct one or it was removed from the source application. Does your organization have any other infrastructure that has such functionality? Yes, some operational applications may have this functionality for their operational purposes. Does your data warehouse have this functionality everywhere? If yes, congratulations, since this is the validity part of the so-called bi-temporality. If not, yours is unfortunately not a data warehouse according to Inmon’s definition! As a matter of fact, there are not many data warehouses in the world that really and completely enable their data with bi-temporality. Thus, you are not alone and not even in the minority.
In summary, the subject orientation is not essential for the definition and the temporal requirements rule out most aspirants from the honorable Inmon’s data warehouse club. It might be the reason why all other data warehouse definitions do not or do not want to consider temporality at all.
   
Inmon’s is the most popular and influential definition for data warehouses. Although it is not perfect as a fundamental definition, as we just reviewed, it enumerates all essential functionalities that a data warehouse should provide for the organization. In my next article, I will suggest a revised data warehouse definition with detailed explanations, based on the discussion here and in my last article.
  • Bin Jiang, Ph.D.Bin Jiang, Ph.D.
    Dr. Bin Jiang received his master’s degree in Computer Science from the University of Dortmund / Germany in 1986. In 1992, he received his doctorate in Computer Science from ETH Zurich / Switzerland. During the research period, two of his publications in the field of database management systems were awarded as the best student papers at the IEEE Conference on Data Engineering in 1990 and 1992.

    Afterward, he worked for several major Swiss banks, insurance companies, retailers, and with one of the largest international data warehousing consulting firms as a system engineer, software developer, and application analyst in the early years, and then as a senior data warehouse consultant and architect for almost twenty years.

    Dr. Bin Jiang is a Distinguished Professor of a large university in China, and the author of the book Constructing Data Warehouses with Metadata-driven Generic Operators (DBJ Publishing, July 2011), which Dr. Claudia Imhoff called “a significant feat” and for which Bill Inmon provided a remarkable foreword. Dr. Jiang can be reached by email at bin.jiang@bluewin.ch

    Editor's Note: You can find more articles from Dr. Bin Jiang and a link to his blog in his BeyeNETWORK expert channel, Data Warehouse Realization.

Recent articles by Bin Jiang, Ph.D.

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!