Over the last year, vendors have been mentioning (alluding to, to be more precise) customers who are building (and maybe even operating) data warehouses containing more than a petabyte of data. An analyst that I follow, Cust Monash, had a recent article that remarked that the "petabyte barrier is crumbing" and that "the 100-terabyte mark is almost old hat".
I sensed the same. However, I have a concern with this trend toward bigger...and hence better.
Do we really need all that data? Can we really manage all that data? Does all that data really result in meaningful business value? Can we, as finite human beings, really comprehend all that data?
I am not concerned about the technical side of the equation. Over time, the storage devices, I/O data channels, MPP grids, parallel query optimizers, and so forth will evolve to handle petabytes and more.
I am concerned about the soft side of he equation. It is a 'simple' matter of information complexity!
As an illustration, if we add more and more rows to a customer table, do we add more value that the business could potentially realize? Likewise, if we add more and more columns to a customer table, do we add more value that the business could potentially realize? If we add more and more tables to the data warehouse, do we add more value that the business could potentially realize? By the phrase 'more and more' I am implying several orders of magitude more!
As IT professionals, we need to take a step (or many steps) backward and look the entire end-to-end process. More information is only better information when the organization can assimulate the information and change its behavior to improve business performance. If the organization can not assimulate the information, then more information is not better. If the organization does not change its behavior, then more information is irrelevant. If the organization does not improve its improveness, then more information has no value or even a negative value.
Therefore, I conclude that petabyte data warehouses will be a liability for most organizations. It will be like giving a hot sport car to a teenage driver. They can not handle the extreme capability. At best, it will be a huge waste of resources. At worst, it may result in the demise of those involved. There will be exceptions. I would like to talk with those who survive their petabyte experience.