We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Different Kinds of Data Warehouses

Originally published May 14, 2009

For years, the world has built classical data warehouses. A classical data warehouse is one that is built upon the foundation of structured data. Structured data is the type of data where there is repeatability. The same type of data is repeated over and over again. Typically, structured data comes from transactions. As each transaction is executed, the same type of data appears over and over. The only thing that is different from one type of data to the next is the actual value.

Typical of structured data are bank transactions – checks, ATM activities, withdrawals, and so forth. These activities are handled many times a day by a bank. And there are many other kinds of structured activities that occur – airline reservations, insurance claims settlements, manufacturing control, inventory control, and so forth. It is fair to say that structured processing represents the day-to-day, everyday activities of the corporation, usually at the clerical level.

And it is also fair to say that analytical processing has occurred against this type of data almost exclusively.

However, DW2.0 has entered the world, and with DW2.0 has come unstructured data. One of the novel aspects of DW2.0 is the notion that unstructured data belongs in a data warehouse, right along with structured data. Central to the processing for DW2.0 is the notion that textual data must first be integrated before it belongs in a data warehouse. Textual integration prepares the text for textual analytics. Textual integration calls for many types of edits to be made to the text before the text is fit to be placed into a data warehouse. Among other things, textual integration calls for such basic text manipulations as:

  • Standardization of dates,

  • Standardization of other measurements of data,

  • Translation of text to numeric values when appropriate,

  • Removal of stop words, and

  • Stemming of words.

There are many different kinds of edits that text must pass through in order for if to be prepared for analytical processing. Stated differently, if text is not subjected to the rigorous processing found in textual ETL, then the scanning and search of the text produces questionable results. The best search engine in the world run against garbage produces garbage. GIGO applies to text as well as it applies to any other form of data.

Once the text is subjected to the rigors of textual ETL, the text is fit to enter a data warehouse. The question then becomes – what kind of data warehouse will the text be placed into?

There are two basic kinds of data warehouses for text that has been passed through the textual ETL process. These types of data warehouses are:

  • An unstructured data warehouse, where the data warehouse contains only unstructured data, or

  • A data warehouse that consists of both structured data and unstructured data.

Both of these two types of data warehouses are possibilities. Both have their place and neither is right or wrong.

An unstructured data warehouse is one in which there is only unstructured data. In an unstructured data warehouse, there is no data whose origin is structured transactions. In some regards, data in an unstructured data warehouse is simplicity itself. Data in an unstructured data warehouse exists in two basic forms – it is either indexed from a document or data is created by looking at named variables and patterns. Or both. In an unstructured data warehouse, both types of data may exist. The analytical processing that can occur in an unstructured data warehouse is processing that can go against either of these two types of data.

One of the more sophisticated types of queries that is processed in an unstructured data warehouse is the query that uses both types of data simultaneously, as the product of a join.

The second type of data warehouse that can be produced is a data warehouse that contains data that comes from both a structured origin and an unstructured origin. This kind of data warehouse contains its unstructured data, from either a simple indexing of a document or from a named or patterned indexing of data, just like an unstructured data warehouse. But in an integrated data warehouse – where there is data whose origin is both unstructured and structured – there is another type of data. That data is data that represents a cross between structured and unstructured data.

In an integrated data warehouse, the data is stored as a relational table. In that regard, there is no difference between the structures of data whose origins are structured or unstructured. But how the data arrived in the relational format and the meaning of the data could not be more different.

Structured data arrived in a table as a result of a modeling process that began as an abstraction of business and business requirements. After the requirements were shaped, the physical design of the database was done. Then data was populated into the table whose origins were structured data.

Data arrived in the unstructured portion of the database in an almost opposite manner. First, the unstructured data was examined. Abstractions that existed in the unstructured data were identified and classes of textual data were created.

At the end of the day, a relationship between structured data and unstructured data is made based on the abstractions that have been identified, albeit identified in very different ways.

Once the abstractions have been created for both unstructured data and structured data, those abstractions are used for linkage between the two worlds. The power of an integrated database that contains both structured and unstructured data depends on the ability to link together both types of data in the analytical process.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!