We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


The 2-Month Data Model

Originally published September 3, 2009

Once upon a time, there were two companies in the same line of business – company A and company B. These companies both decided that they needed a data model for future development. So they both proceeded down that path for the construction of a data model. Company A gave their consultants 2 months to do the model. Company B gave their consultants a year to do their data model.

After both models were completed, a comparison was made. It was true that company B’s data model was a bit more detailed than company A’s data model. But the remarkable thing was that the models were very similar. Yet one data model had taken much longer to build than the other data model. Was the extra ten months of time and money spent by company B on its data model worth it? What is going on here?

One of the interesting facets of data modeling is that building a data model consumes whatever resources you throw at it. If you give an organization 2 months to get a data model done, they will get approximately the same results as a company that spends much longer on the data model. A data model development expands into whatever resources it is allocated.

But there are other interesting and mitigating characteristics related to the length of time required to construct a data model. One of those characteristics is that a data model – properly built – operates at the lowest level of granularity. The most atomic data that a corporation has is the data that is the most fit for inclusion into the data model. Conversely, the more summarized and the more aggregated that corporate data is, the less the data fits the data model. There is a good reason for this affinity of atomic data and the disaffinity of summarized and aggregated data for the data model. That reason is that the more summarized data gets to be, the less stable the data.

The epitome of unstable data is data found on a spreadsheet. Data found on a spreadsheet can be changed on a minute by minute or even a second by second basis. If a data modeler tries to keep up with the developer of the spreadsheet, the data modeler will lose. The data modeler simply cannot keep up with the rate at which the elements of data are changed by the spreadsheet developer when dealing with a spreadsheet. It is only the basic, atomic data elements that remain constant. Therefore, it behooves the data modeler to focus on the most stable data as the data that belongs in a data model.

Another way of looking at the data model as a collection of atomic data is that if the atomic data is gathered and organized properly into the data model, then the summarized or aggregated data that stems from the atomic data can be calculated at a later point in time.

When the data modeler allows summarized or aggregated data into the data model, the process of data modeling does not ever finish. The data model needs to be changed every time the end user changes his/her mind about how summarized data is to be calculated. And the end user is constantly changing his/her mind as to the algorithms for the creation of summarized and aggregated data.

Another complicating element of trying to model summarized data in a data model is that of the algorithm used for the calculation or aggregation of the summarized data. With atomic data, there usually is no algorithm needed for calculation, as most atomic data is merely gathered into the system, usually as a by-product of the execution of a transaction. But when it comes to summarized data, there is ALWAYS an algorithm associated with the calculation of data. Furthermore, that algorithm is subject to change at a moment’s notice.

Because summarized and aggregated data has an algorithm associated with it in every case, and because that algorithm is subject to change, the data modeler is faced with keeping track of the algorithmic changes over time. In other words, if an analyst is looking at some data that has been modeled and that data is summarized, in order to understand the data, it is not sufficient to merely understand the algorithm associated with the summarized data. Instead, the data modeler must understand the specific version of the algorithm that was used at the moment in time the summarization was made.

In a word, summarization in a data model and the tracking of the algorithm involved in the calculation of the data introduces an order of complexity that is simply undesirable. Introducing summarized data into a data model means that the data modeling process will be complicated and eternal. In the worst case, the rate of change and the propensity for change of the summarized data means that the data model will never be finished.

But there is another dimension to the world of data modeling that greatly affects the speed with which the data model can be created. That dimension is the use of “generic” data models.

To explain the value of generic data models, consider the experience of many consultants. One day, a consultant goes to work for a bank as a data modeler. The consultant builds a data model for the bank as part of his/her job. The consultant then finishes the data model and finds another job as a data modeler at another bank. The consultant then commences to build the data model for the second bank. Upon finishing the second data model, the consultant makes a sharp observation. The data model produced for the first bank is almost a carbon copy of the data model created for the second bank. The consultant has just made the discovery that the data models created for two companies inside the same industry are very, very similar. Now, if the consultant had first created a banking data model and then created a data model for the manufacturing environment, it is very likely that the consultant would have seen very little resemblance between the two data models. But within the same industry, there is tremendous overlap from one model to the other.

And because there is such similarity of models from one company in the same industry as another, there arises the idea of the generic data model. The generic data model is the data model that is created for an industry, not a specific company. When it comes to building a data model really quickly, generic data models save a huge amount of time and money.

From these simple examples, it is seen that there are indeed shortcuts that can be taken in the creation of the data model. A data model does not have to cost a huge amount of money and the data model does not have to consume a huge amount of time for its construction.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon

 

Comments

Want to post a comment? Login or become a member today!

Posted November 4, 2009 by korhan yunak

Thanks for article Bill. Would it be possible to give couple of vendors which should be seriously considered when choosing a industry/generic data models? Couple of them as I can identify includes ADRM Software, Oracle, Teradata, IBM, Universal Data Models LLC, HP. Are there any other vendors in this area? And what do you think would be the good approach/methods/criterias choosing the industry/generic data models from your point of view?

thanks

Korhan Yunak

Is this comment inappropriate? Click here to flag this comment.