The other day at a conference I heard a conversation that went something like this: “Now that everyone has big data, we don’t need data warehousing anymore.” This conversation was occurring between a vendor of big data technology and his/her prospect. This thought set my mental wheels in motion. Maybe you don’t need a data warehouse if you have big data. Indeed there are some similarities between a data warehouse and a big data solution. Both hold a lot of data. Both can be used for reporting. Both are managed by electronic storage devices. So why aren’t they interchangeable? If I buy a big data solution, doesn’t that obviate the need for a data warehouse?
What is Big Data?
In order to examine the truth (or lack thereof) in this line of thinking, we need to start with the basics. First, what is big data? There are actually many different forms of big data. But the most widely understood form of big data is the form found in Hadoop, Cloudera, et al.
A good working definition of big data solutions is:
- Technology capable of holding very large amounts of data.
- Technology that can hold the data in inexpensive storage devices.
- Technology where processing is done by the “Roman census” method.
- Technology where the data is stored in an unstructured format.
There are probably other ramifications and features, but these basic characteristics are a good working description of what most people mean when they talk about a big data solution. (In order to verify this working definition, refer to the websites of Cloudera or HortonWorks.)
What is a Data Warehouse?
There are different interpretations of what is meant by big data, and there are different interpretations of what is meant by data warehousing. In principle, there is the Kimball approach to data warehousing, and there is the Inmon approach to data warehousing. For the purposes of this article, the Inmon approach to data warehousing will be discussed. The Inmon approach to data warehousing centers around the definition of a data warehouse, which was given many years ago. A data warehouse is a subject-oriented, nonvolatile, integrated, time variant collection of data created for the purpose of management’s decision making. Another way of saying the same thing is that a data warehouse provides a “single version of the truth” for decision making in the corporation. With a data warehouse there is an integrated, granular, historical single point of reference for data in the corporation.
So why do people want a big data solution? People want a big data solution because in a lot of corporations there is a lot of data. And in those corporations that data – if unlocked properly – can contain much valuable information that can lead to better decisions that, in turn, can lead to more revenue, more profitability and more customers. And that is what most corporations want.
And why do people need a data warehouse? People need a data warehouse in order to make informed decisions. In order to really know what is going on in your corporation, you need data that is reliable, believable and accessible to everyone.
Comparing Big Data Solutions to a Data Warehouse
So when we compare a big data solution to a data warehouse, what do we find? We find that a big data solution is a technology and that data warehousing is an architecture. They are two very different things. A technology is just that – a means to store and manage large amounts of data. A data warehouse is a way of organizing data so that there is corporate credibility and integrity. When someone takes data from a data warehouse, that person knows that other people are using the same data for other purposes. There is a basis for reconcilability of data when there is a data warehouse.
The difference between a technology and an architecture is the difference between hammers and nails and Santa Fe, New Mexico. Hammers and nails can be used to build many different things. You can build houses, tables, bridges, desks and many things with hammers and nails. The houses in Santa Fe are all of a distinctive architecture. In Santa Fe you find adobe, exposed beams and vigas. When you are in Santa Fe, you know that you are nowhere else. Santa Fe has its own architecture. And it is true that the homes and buildings in Santa Fe have been built from hammers and nails. But go to Santa Fe, and the difference between a technology and an architecture will be very clear to you.
Another Way to Look at the Issue
Looking at this another way, can an organization have a big data solution and not have a data warehouse? Yes, they can. Can an organization have a big data solution and have a data warehouse? Yes, they can. Can an organization have a data warehouse and not have a big data solution? Yes, they can. Can an organization have a data warehouse and have a big data solution? Yes, they can.
There is no correlation then between a big data solution and a data warehouse. They are not the same thing.
Revisiting the Question
With all of that in mind, let’s go back and examine the question that we started with – if you have big data, do you need a data warehouse? The answer is that as long as your corporation has a need for reliable, believable and accessible data that everyone in the corporation can rely on, then you need a data warehouse. Having big data is neither here nor there when it comes to needing a data warehouse.
So why would a vendor try to tell someone that an installation of a big data solution as a replacement for a data warehouse? Perhaps the vendor doesn’t understand what a data warehouse is? Or perhaps a vendor just wants to make a sale and doesn’t really care what has to be said to make a sale.
A big data solution is not a replacement for a data warehouse.
Recent articles by Bill Inmon