Forbes magazine defined data lake by comparing it to a data warehouse: "The difference between a data lake and a data warehouse is that the data in a data warehouse is pre-categorized at the point of entry, which can dictate how it's going to be analyzed." Data hub is a term used by various big data vendors. In a recent tweet, Clarke Patterson summarized it as follows: data lake = storage + integration; and data hub = data lake + exploration + discovery + analytics. A data reservoir is a well-managed data lake.
All three architectures have in common that data from many sources is copied to one centralized data storage environment. After it has been stored in the data lake (or hub or reservoir), it's available for all business purposes such as integration and analysis. The preferred technology for developing all three is Hadoop. The reason that Hadoop is selected is, firstly, because its low price/performance ratio makes it attractive for developing large scale environments, and secondly, it allows raw data to be stored in its original form and not in a form that already limits how we can use the data later on.
In principle, there is nothing wrong with idea of making all data available in a way that leaves open any form of use, but is moving all the data to one physical data storage environment implemented with Hadoop the smartest alternative?
We have to remember that Hadoop's primary strength and from which it derives its processing scalability, is that it pushes processing to the data, and not vice versa as most other platforms do. Because it's not pushing the data to the processing, Hadoop is able to offer high levels of storage and processing scalability.
Now, isn't this inconsistent with the data lake, data hub, and data reservoir architectures? On one hand we have, for example, the data lake in which all the data is pushed to a centralized data storage environment for processing, and on the other hand, we deploy Hadoop technology, which shines when processing is pushed to the data.
Shouldn't we come up with architectures that are more in line with the fundamental design concept of Hadoop itself, namely architectures where processing (integration and analysis) is pushed to the systems where the data originates (where the data is entered and collected)?
What is needed are systems that make applications and users think that all data is centralized and that all can be freely analyzed. This also avoids massive and continuous data movement exercises and avoids duplicate storage of data. In other words, we need virtual data lakes, virtual data hubs and virtual data reservoirs. Such architectures would be much more in line with Hadoop, which is based on pushing processing to the data. Maybe we should open this discussion first before we recommend organizations to invest heavily in massive data lakes in which they may drown.
Posted May 19, 2014 12:48 AM
Permalink | No Comments |