We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is founder and principal consultant at Eckerson Group,a research and consulting company focused on business intelligence, analytics and big data.

As silver bullets go, the "data lake" is a good one. Pitched by big data advocates, the data lake promises to speed the delivery of information and insights to the business community without the hassles imposed by IT-centric data warehousing processes. It almost seems too good to be true.

With a data lake, you simply dump all your data, both structured and unstructured, into the lake (i.e. Hadoop) and then let business people "distill" their own parochial views within it using whatever technology is best suited to the task (i.e. SQL or NoSQL, disk-based or in-memory databases, MPP or SMP.) And you create enterprise views by compiling and aggregating data from multiple local views. The mantra of the data lake is think global, act local. Not bad!

Data Lake Benefits. Assuming this approach works, there are many benefits. First, the data lake gives business users immediate access to all data. They don't have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery and offers unparalleled flexibility since nobody or no thing stands between business users and the data.

Second, data in the lake is not limited to relational or transactional data--the traditional fare served by data warehouses. The data lake can contain any type of data: clickstream, machine-generated, social media, and external data, and even audio, video, and text. It's a proverbial cornucopia of data delights for the data digerati.

Third, with a data lake, you never need to move the data. And that's important in the era of big data. The data streams into the lake and stays there. You process it in place using whatever technology you want and serve it up however users want. But the data never leaves the lake. It's one big body of water with many different fishing spots, one for every type of sportsman.

So, there is a lot to like about the data lake: it empowers business users, liberating them from the bonds of IT domination; it speeds delivery, enabling business units to stand up applications quickly; and it ushers in new types of data and technology that lower the costs of data processing while improving performance. So what's the problem?

Alligators in the Swamp

Uncharted territory. Although big data advocates are quite adept at promoting their stuff (and even better at bashing the data warehouse), they never tell you about the alligators in the swamp. Since very few companies have actually implemented a data lake, perhaps no one has seen the creatures yet. But they are there. In fact, the first razor-toothed amphibian that should cause your adrenalin to surge is the fact that the data lake is uncharted water. This stuff is so new, that only real risk-takers are willing to swim in the swamp.

Expensive. The risk, however, presents a great sales opportunity. Product and services vendors are more than willing to help you reap the benefits of the data lake, while minimizing the risk. Don't have Hadoop, MPP, in-memory engines, or SQL-on-Hadoop tools or any experience managing them? No problem, we can sell and implement those technologies for you. Don't know how to distill local and enterprise views from the lake? No worries, our consultants can help you architect, design, and even manage the lake for you. All you need to take a swim is money, and lots of it! That's the second danger: the threat to your budget.

Data governance. The biggest peril, however, is the subtle message that it's easy to create any view you want in the data lake. Proponents make it seem like the data lake's water has magical properties that automatically build local and enterprise views. But diving into the details, you discover that the data lake depends on comprehensive master data management (MDM) program. Before you can build views, you need to define and manage core metrics and dimensions, ideally in a consistent way across the enterprise. You then link together virtual tables using these entities to create the local or enterprise views you want.

The problem with this approach is that MDM is hard. It's hard for the same reason data warehousing is hard. Defining core entities is a business task that is fraught with politics. No one agrees how to define basic terms like "customer" or "sale". Therefore, the temptation is to simply build local solutions with local definitions. This meets the immediate need but does nothing to create a common view of the enterprise that executives need to run the business. An organization with lots of local views but no corporate view is like a chicken with its head cut off: it dashes madly here and there until it suddenly drops dead.

Of course, if your organization has invested in MDM, then building enterprise views in a data lake is easy. But the same is true of a data warehouse. When an MDM solution assumes the burden of reconciling business entities, then building a data warehouse is a swim in the lake, so to speak.

Courting Chaos

Let's be honest: the data lake is geared to power users who want and need immediate access to all data as well as business units that want to build their own data-driven solutions quickly without corporate IT involvement. These are real needs and the data lake offers a wonderful way to address them.

But please don't believe that a data lake is going to easily give you enterprise views of your organization populated with clean, consistent, integrated data. Unless you have a full-fledged MDM environment and astute data architects, the data lake isn't going digitally unify your organization. And that will disappoint the CEO and CFO. To make the data lake work for everyone requires a comprehensive data governance program, something that few organizations have implemented and even fewer have deployed successfully.

Ultimately, the data lake is a response to and antidote for the repressive data culture that exists in many companies. We've given too much power to the control freaks (i.e. IT architects) who feel the need to normalize, model, and secure every piece of data that comes into the organization. Even data warehousing professionals feel this way; they have developed and evangelized more agile, flexible approaches to deliver information to the masses.

Frankly, the data lake courts chaos. And that's fine. We need a measure of data chaos to keep the data nazis in check. The real problem with the data lake is that there are no warning signs to caution unsuspecting business people about the dangers lurking in its waters.

In subsequent posts, I'll debunk the criticisms of the data warehouse by the data lakers and present a new reference architecture (actually an ecosystem) that shows how to blend the data lake and more traditional approaches into a happy, harmonious whole.

Posted March 12, 2014 4:49 PM
Permalink | 6 Comments |


Hi Wayne,
Excellent points! I agree. The problem with the idea of a data lake (or reservoir) is that every individual drop of water is indistinguishable from the next. In data terms, this means that context-setting becomes well-nigh impossible...

Ah Wayne - where have we heard this wonderful story before...? You don't need no stinkin' warehouse! Sorry, folks, Wayne is spot on. There is no silver bullet for data integration. If your BI and analytical results require fully integrated, cleaned up data, then you have to do the hard work to get there. A data lake becomes a data swamp without this hard work.

So right Wayne,Barry and Claudia,

I have to fight some inhouse belief that the 'data lake' is the cure-all for information misuse. This although we have a good EDW, but the marketing weight of Hadoop providers is so high!

I'm reminded me of the early days of warehousing. It wasn't called data lake but the idea was there.
'Pour it in, we'll know what to do!'
Funny though, it didn't work.
We learnt the hard way that unsurprisingly, the tough part is not creating the landfills, it is getting value out of it and preparing for that.
i like to go back to down-to-earth basics: When you observe landfills, you see people sorting out the trash. You can also tranform garbage into methane, or heat or whatever if you know what you are looking for, or at least will know once you see it. Transformation, know-how, knowledge and knowledge about knowledge, insight: all that takes effort, talent and time.

I think we could get away with Data Governance (and all sub domains) for data warehousing and BI, but I believe we need to up it to Information Governance (and all sub domains + new competencies like ML) for big data, due to the records, documents, and various semi and unstructured informations snippets.

I guess, people will still believe in silver bullets, the shinier the better... Which is great if you are in the remedial section of consulting, but not satisfying if you are a builder and a creator.

Xavier, I feel your pain. Barry and Claudia, thanks for the endorsement. At least the data lake spawns wonderful analogies!

Excellent and insightful ! Couldn't agree with you more, especially about the need for data governance, and also data discovery.

Mr Wayne. I agree with 100%. regards BB

Leave a comment

Search this blog
Categories ›
Archives ›
Recent Entries ›