We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Data Warehousing: How Much Data?

Originally published May 26, 2011

Talk to any self-respecting data warehouse developer and you get a story about a lot of data. Data warehouses are large because data warehouse contain a combination of historical data, detailed data, and a wide variety of types of data. There is an equation:

History x detail x variety = lots of data

And with lots of data come some predictable problems. Loading data is an issue. Reorganizing data is an issue. Indexing data is an issue. Accessing data efficiently is an issue. Finding and managing dormant data is an issue. The budget required for a data warehouse is an issue.  The technology required to manage ever-increasing volumes of data is an issue. In short, keeping up with data in a data warehouse is a challenge unto itself.

Now DW 2.0 comes along. And with DW 2.0 it is innocently suggested that we start to grapple with unstructured, textual data in a data warehouse. Let’s do a quick calculation. It is estimated that there is from 5 to 10 times as much textual data in the corporation as there is classical structured data. And in nearly every corporation, the data warehouse is made up exclusively of structured information. So let’s do an analysis. Today we have a challenge managing the structured data in our corporations. Once we start to add unstructured data to our data warehouses, it is going to make them up to TEN times as large as what we have today. Is that what is being said here?

With the volumes of data that are coming our way with unstructured data, storage needs are going to get larger. But how much larger?

While it is true that there is a lot of unstructured data in the world, is ALL of it going to be placed on disk storage? Certainly a lot of it is going to be placed on disk storage, but not really ALL of it will find its way to disk storage.

So what part of unstructured data will not/should not find its way to disk storage? There is a lot of weeding out to be done:
  • There are essentially three types of mail – personal email, spam, and business-related email. Only business related email should find its way to a data warehouse. Spam and personal email should be weeded out.

  • Stop words need to be filtered out. In some languages, stop words take up to 40% of the text.

  • Some unstructured processing needs fractured documents. Other types of unstructured processing need only selective indexing. Fractured documents take up much more space than documents that are selectively indexed.

  • Some documents need only their metadata referenced. It is far more efficient to index document metadata than it is to document the contents of those documents, and so forth.
Another factor is that the unstructured data will not find its way into a data warehouse all at once. It is going to take some number of years to factor in all the unstructured data that belongs in a data warehouse.

So not all unstructured data needs to find its way into a data warehouse, certainly not all at once. But a lot of unstructured data will eventually find its way into a data warehouse.  If one were to do an educated guess, probably taking today’s structured data warehouse and multiplying that volume by a factor of three or four would be a good guess. And that’s a lot of data in anyone’s book.    


  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!