As silver bullets go, the "data lake" is a good one. Pitched by big data advocates, the data lake promises to speed the delivery of information and insights to the business community without the hassles imposed by IT-centric data warehousing processes. It almost seems too good to be true.
With a data lake, you simply dump all your data, both structured and unstructured, into the lake (i.e. Hadoop) and then let business people "distill" their own parochial views within it using whatever technology is best suited to the task (i.e. SQL or NoSQL, disk-based or in-memory databases, MPP or SMP.) And you create enterprise views by compiling and aggregating data from multiple local views. The mantra of the data lake is think global, act local. Not bad!
Data Lake Benefits. Assuming this approach works, there are many benefits. First, the data lake gives business users immediate access to all data. They don't have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery and offers unparalleled flexibility since nobody or no thing stands between business users and the data.
Second, data in the lake is not limited to relational or transactional data--the traditional fare served by data warehouses. The data lake can contain any type of data: clickstream, machine-generated, social media, and external data, and even audio, video, and text. It's a proverbial cornucopia of data delights for the data digerati.
Third, with a data lake, you never need to move the data. And that's important in the era of big data. The data streams into the lake and stays there. You process it in place using whatever technology you want and serve it up however users want. But the data never leaves the lake. It's one big body of water with many different fishing spots, one for every type of sportsman.
So, there is a lot to like about the data lake: it empowers business users, liberating them from the bonds of IT domination; it speeds delivery, enabling business units to stand up applications quickly; and it ushers in new types of data and technology that lower the costs of data processing while improving performance. So what's the problem?
Alligators in the Swamp
Uncharted territory. Although big data advocates are quite adept at promoting their stuff (and even better at bashing the data warehouse), they never tell you about the alligators in the swamp. Since very few companies have actually implemented a data lake, perhaps no one has seen the creatures yet. But they are there. In fact, the first razor-toothed amphibian that should cause your adrenalin to surge is the fact that the data lake is uncharted water. This stuff is so new, that only real risk-takers are willing to swim in the swamp.
Expensive. The risk, however, presents a great sales opportunity. Product and services vendors are more than willing to help you reap the benefits of the data lake, while minimizing the risk. Don't have Hadoop, MPP, in-memory engines, or SQL-on-Hadoop tools or any experience managing them? No problem, we can sell and implement those technologies for you. Don't know how to distill local and enterprise views from the lake? No worries, our consultants can help you architect, design, and even manage the lake for you. All you need to take a swim is money, and lots of it! That's the second danger: the threat to your budget.
Data governance. The biggest peril, however, is the subtle message that it's easy to create any view you want in the data lake. Proponents make it seem like the data lake's water has magical properties that automatically build local and enterprise views. But diving into the details, you discover that the data lake depends on comprehensive master data management (MDM) program. Before you can build views, you need to define and manage core metrics and dimensions, ideally in a consistent way across the enterprise. You then link together virtual tables using these entities to create the local or enterprise views you want.
The problem with this approach is that MDM is hard. It's hard for the same reason data warehousing is hard. Defining core entities is a business task that is fraught with politics. No one agrees how to define basic terms like "customer" or "sale". Therefore, the temptation is to simply build local solutions with local definitions. This meets the immediate need but does nothing to create a common view of the enterprise that executives need to run the business. An organization with lots of local views but no corporate view is like a chicken with its head cut off: it dashes madly here and there until it suddenly drops dead.
Of course, if your organization has invested in MDM, then building enterprise views in a data lake is easy. But the same is true of a data warehouse. When an MDM solution assumes the burden of reconciling business entities, then building a data warehouse is a swim in the lake, so to speak.
Let's be honest: the data lake is geared to power users who want and need immediate access to all data as well as business units that want to build their own data-driven solutions quickly without corporate IT involvement. These are real needs and the data lake offers a wonderful way to address them.
But please don't believe that a data lake is going to easily give you enterprise views of your organization populated with clean, consistent, integrated data. Unless you have a full-fledged MDM environment and astute data architects, the data lake isn't going digitally unify your organization. And that will disappoint the CEO and CFO. To make the data lake work for everyone requires a comprehensive data governance program, something that few organizations have implemented and even fewer have deployed successfully.
Ultimately, the data lake is a response to and antidote for the repressive data culture that exists in many companies. We've given too much power to the control freaks (i.e. IT architects) who feel the need to normalize, model, and secure every piece of data that comes into the organization. Even data warehousing professionals feel this way; they have developed and evangelized more agile, flexible approaches to deliver information to the masses.
Frankly, the data lake courts chaos. And that's fine. We need a measure of data chaos to keep the data nazis in check. The real problem with the data lake is that there are no warning signs to caution unsuspecting business people about the dangers lurking in its waters.
In subsequent posts, I'll debunk the criticisms of the data warehouse by the data lakers and present a new reference architecture (actually an ecosystem) that shows how to blend the data lake and more traditional approaches into a happy, harmonious whole.