Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at

Hadoop advocates know they've struck gold. They've got new technology that promises to transform the way organizations capture, access, analyze, and act on information. (See Big Data Part II: "Hadoop 2 Changes Everything.") Market watchers estimate the potential revenue from big data software and systems to be in the tens of billions of dollars. So, it's not surprising that Hadoop advocates are eager to discard the old to make way for the new.

But in their haste, some Hadoop advocates have plied a lot of misinformation about so-called "traditional" systems, especially the data warehouse. They seem to think that by bashing the data warehouse, they'll accelerate the pace at which people adopt Hadoop and the "data lake". (See "Big Data Part I: Beware of the Alligators in the Data Lake"). This is a counterproductive strategy for a couple of reasons.

Evolution, Not Revolution. First, the data warehouse will be an integral part of the analytical ecosystem for many years to come. It will take many years (decades?) for a majority of companies to convert their data and analytics architecture to a data lake powered by Hadoop, if they do at all. Organizations simply have too much time, money, resources, and skills tied up with existing systems and applications to throw them away and start anew. The mantra of big data is evolution, not revolution. (To learn about these countervailing strategies, see "The Battle for the Future of Hadoop.")

Slippery Slope. Second, Hadoop is at the beginning of its journey, and while things look bright and rosy now, this new architecture will inevitably encounter dark times and failures, just like all new technologies. Thus, it's unwise for Hadoop advocates to take potshots at a mature technology, like the data warehouse, which has been refined in the crucible of thousands of real-world implementations. Just because there are data warehousing failures doesn't mean the technology is bankrupt or that a majority of organizations are eager to cast their data processing destiny to a new, untested platform whose deficiencies have yet to emerge.

Too Much to Bear. Many data warehousing deficiencies stem from the fact that the data warehouse has been asked it to shoulder a bigger load than it was designed to handle. A data warehouse is best used to deliver answers to known questions: it allows users to monitor performance along predefined metrics and drill down and across related dimensions to gain additional context about a situation. It isn't optimized to support unfettered exploration and discovery or to store and provide access to non-relational data.

But, since the data warehouse has been the only analytical game in town for the past 20 years, organizations have tried to shoehorn into it many workloads that it's not suited to handle. These failures aren't a blemish against the data warehouse as much as evidence of a lack of imagination about how best to solve various types of data processing problems. Fortunately, we now have other ways to capture, store, access, and analyze data. So, we can finally offload some of these workloads from our overburdened data warehouses and give them space to do what they do best--populate reports and dashboards with clean, integrated, and certified data.

A Process, Not a Technology. A final reason that Hadoop proponents shouldn't disparage the data warehouse is because the data warehouse is ultimately a process, not a technology. A data warehouse reunites an organization in electronic form (i.e. data) so that it can function as a single entity, not a conglomeration of loosely coupled fiefdoms. In this sense, the data warehouse will never go away.

The truth is that companies can implement a data warehouse with a variety of technologies and tools, including a data lake. Some are better than others, and none is sufficient in and of itself. But that is not the point: a data warehouse is really an abstraction, a logical representation of clean, vetted data that executives can use to make decisions. Without a data warehouse, executives run blind, making critical decisions with inaccurate data or no data at all.

So, despite what some critics say, the data warehouse is here to stay. It will remain a prominent fixture in analytical environments for many years to come.

Posted April 10, 2014 2:41 PM
Permalink | No Comments |

Last year, we could write off Hadoop as a giant, low cost, data processing pump for refining multi-structured data and delivering it to the data warehouse. No more. Hadoop 2, and in particular the Apache Yarn project, changes everything.

Released in October of 2013, Hadoop 2 turns the open source data management platform into a multipurpose operating system for big data. Rather than supporting just one type of data processing, Hadoop 2.0 supports any data processing application written to the YARN interface. As such, Hadoop 2 can not only support batch processing (i.e. MapReduce), but also real-time queries, search, in-memory computing and whatever else anyone dreams up and writes to Yarn.

The upshot is revolutionary: rather than move data to specialized applications and systems for processing, companies can process the data in Hadoop without moving it.

This message was trumpeted at last week's analyst day hosted by Cloudera, the first vendor to commercialize Hadoop services. In his opening remarks, Cloudera CEO Tom Reilly, said that Hadoop 2 will change how companies architect analytic systems: "Rather than move data to compute resources, companies will move compute resources to data, saving enormous amounts of time and money."

The Data Lake

This new approach gives rise to the notion of a data lake, in which Hadoop not only stores all the data but processes it as well. (See "Beware of the Alligators in the Data Lake"). Cloudera is one of the first companies to commercialize the concept of a data lake, which it calls an Enterprise Data Hub (EDH). With an annual subscription, Cloudera EDH customers can access five premium components (or data processing engines), including batch processing (MapReduce), analytic SQL (Impala), search (SOLR), machine learning (Spark), and stream processing (Spark Streaming) and operational processing (HBase) with a raft of third party applications on the way.

Converged Applications. The data lake spawns a new breed of "converged applications", according to Reilly, that deliver enormous business value. For instance, a company can use Spark streaming to stream data from a sensor network into an in-memory database (Spark) where it is analyzed and turned into a model that gets embedded in a high-volume Web application (HBase). All the while, the data never leaves Hadoop, which greatly simplifies data processing and reduces costs.

"EDH enables customers to build new types of applications that weren't feasible or cost-effective before," says Reilly.

Although many claim that Hadoop is not yet ready to support enterprise-caliber, production applications, Cloudera says demand for EDH is high. In fact, it sold eight subscriptions within six weeks at the end of the first quarter in which EDH was commercially available. Clearly, some leading-edge companies are jumping headfirst onto the Hadoop bandwagon, clearing the trail for everyone else.

Evolving into Hadoop

However, most companies are adopting Hadoop gradually, says Amr Awadallah, co-founder and CTO of Cloudera. Their initial motivation in adopting Hadop is to improve operational efficiency. Either they want to reduce the cost of storing large volumes of data, accelerate ETL processes which are being squeezed by shrinking batch windows, or optimize a data warehouse by offloading ETL workloads or moving unused data to archival storage.

After organizations squeeze the cost-efficiencies from their data architectures, Awadallah says they implement Hadoop strategically to deliver greater business value. At first, they use Hadoop to give business analysts, data scientists, and lines of business quicker access to data so they can solve pressing business problems. Rather than wait for the IT department to move data from Hadoop into the data warehouse or other downstream systems, business users query data directly in Hadoop using SQL-like data access and analytics tools.

Once business users are comfortable accessing data directly in Hadoop, organizations typically consolidate Hadoop clusters into a data lake and implement Yarn-compliant engines so they can build converged applications, as described above, that deliver outsized competitive advantage.

Stages of grief. David McJannet, vice president of marketing at Hortonworks, Cloudera's closest rival, reinforces Awadallah's depiction of the Hadoop journey. He says most companies go through several "stages of grief" from denial to acceptance when confronted with the fact that Hadoop storage is 30 to 50 times cheaper than traditional systems.

But rather than take a bold leap into the unknown with a startup company, McJannet says Hortonworks customers usually recruit a trusted partner from the commercial world to help them navigate the new terrain and blend the new world with the old. This evolutionary approach to implementing Hadoop is the centerpiece of Hortonworks' strategy. (See "The Battle for the Future of Hadoop").

McJannet also says that Hortonworks customers typically implement Hadoop to support new applications with multi-structured data, not to achieve operational efficiencies. "About 70% of our deals are for net new applications and 30% focus on data warehousing optimization," he says.


Whatever the starting or ending point, it's clear that Hadoop is shaking up the data management and analytics marketplace. In fact, during my rounds of Hadoop vendors last week, including Cloudera, Hortonworks, and MapR, all said they have experienced a rapid uptake in the number of inquiries and deals in the past six to nine months.

If these claims are true, then customers are quickly moving beyond the "tire-kicking" stage and into production with Hadoop. If so, 2014 could be the year in which Hadoop went mainstream. And this even before most customers have implemented Hadoop 2.0.

Posted March 26, 2014 12:55 PM
Permalink | 1 Comment |

Say you have a ton of data in Hadoop and you want to explore it. But you don't want to move it into another system. (After all, it's big data so why move it?) But you don't want to go through the hassle and expense of creating table schemas in Hadoop to support fast queries. (After all, this is not supposed to be a data warehouse.) So what do you do??

You Hunk it. That is, you search it using Splunk software that creates virtual indexes in Hadoop. With Hunk, you don't have to move the data out of Hadoop and into an outboard analytical engine (including Splunk Enterprise). And you don't need to create table schemas in advance or at run time to guide (and limit) queries along predefined pathways. With Hunk, you point and go. It's search for Hadoop, but more scalable and manageable than open source search engines, such as SOLR, according to Splunk officials.

Hunk generates MapReduce under the covers, so it's not an interactive query system. However, it does stream results immediately once the job starts, so an analyst can see whether his search criteria generates the desired results. If not, he can stop the search, change the criteria, and start again. So, it's as interactive as batch can get.

Also, since Hunk is a Hadoop search engine, you cannot do basic things you can do with SQL, such as join tables or add up columns easily or store data in a more compressed format. But it does let you search or explore data without specifying schema or other advanced setup.

And unlike Splunk Enterprise which only runs against log and sensor data, Splunk Hunk (gotta love that product name) can run against any data because it processes data using MapReduce. For instance, Hunk can search for videos with lots of red in them by invoking a a MapReduce function that identifies color patterns in videos. You can also run queries that span indexes created in Splunk Enterprise and Hunk, making Hunk a federated query tool. And like Splunk Enterprise, Hunk supports 100+ analytical functions, making it more than just a Hadoop search tool.

So, if you're in the market for a bonafide exploration tool for Hadoop, try Hunk.

For more information, see

Posted March 17, 2014 7:22 PM
Permalink | No Comments |

As silver bullets go, the "data lake" is a good one. Pitched by big data advocates, the data lake promises to speed the delivery of information and insights to the business community without the hassles imposed by IT-centric data warehousing processes. It almost seems too good to be true.

With a data lake, you simply dump all your data, both structured and unstructured, into the lake (i.e. Hadoop) and then let business people "distill" their own parochial views within it using whatever technology is best suited to the task (i.e. SQL or NoSQL, disk-based or in-memory databases, MPP or SMP.) And you create enterprise views by compiling and aggregating data from multiple local views. The mantra of the data lake is think global, act local. Not bad!

Data Lake Benefits. Assuming this approach works, there are many benefits. First, the data lake gives business users immediate access to all data. They don't have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery and offers unparalleled flexibility since nobody or no thing stands between business users and the data.

Second, data in the lake is not limited to relational or transactional data--the traditional fare served by data warehouses. The data lake can contain any type of data: clickstream, machine-generated, social media, and external data, and even audio, video, and text. It's a proverbial cornucopia of data delights for the data digerati.

Third, with a data lake, you never need to move the data. And that's important in the era of big data. The data streams into the lake and stays there. You process it in place using whatever technology you want and serve it up however users want. But the data never leaves the lake. It's one big body of water with many different fishing spots, one for every type of sportsman.

So, there is a lot to like about the data lake: it empowers business users, liberating them from the bonds of IT domination; it speeds delivery, enabling business units to stand up applications quickly; and it ushers in new types of data and technology that lower the costs of data processing while improving performance. So what's the problem?

Alligators in the Swamp

Uncharted territory. Although big data advocates are quite adept at promoting their stuff (and even better at bashing the data warehouse), they never tell you about the alligators in the swamp. Since very few companies have actually implemented a data lake, perhaps no one has seen the creatures yet. But they are there. In fact, the first razor-toothed amphibian that should cause your adrenalin to surge is the fact that the data lake is uncharted water. This stuff is so new, that only real risk-takers are willing to swim in the swamp.

Expensive. The risk, however, presents a great sales opportunity. Product and services vendors are more than willing to help you reap the benefits of the data lake, while minimizing the risk. Don't have Hadoop, MPP, in-memory engines, or SQL-on-Hadoop tools or any experience managing them? No problem, we can sell and implement those technologies for you. Don't know how to distill local and enterprise views from the lake? No worries, our consultants can help you architect, design, and even manage the lake for you. All you need to take a swim is money, and lots of it! That's the second danger: the threat to your budget.

Data governance. The biggest peril, however, is the subtle message that it's easy to create any view you want in the data lake. Proponents make it seem like the data lake's water has magical properties that automatically build local and enterprise views. But diving into the details, you discover that the data lake depends on comprehensive master data management (MDM) program. Before you can build views, you need to define and manage core metrics and dimensions, ideally in a consistent way across the enterprise. You then link together virtual tables using these entities to create the local or enterprise views you want.

The problem with this approach is that MDM is hard. It's hard for the same reason data warehousing is hard. Defining core entities is a business task that is fraught with politics. No one agrees how to define basic terms like "customer" or "sale". Therefore, the temptation is to simply build local solutions with local definitions. This meets the immediate need but does nothing to create a common view of the enterprise that executives need to run the business. An organization with lots of local views but no corporate view is like a chicken with its head cut off: it dashes madly here and there until it suddenly drops dead.

Of course, if your organization has invested in MDM, then building enterprise views in a data lake is easy. But the same is true of a data warehouse. When an MDM solution assumes the burden of reconciling business entities, then building a data warehouse is a swim in the lake, so to speak.

Courting Chaos

Let's be honest: the data lake is geared to power users who want and need immediate access to all data as well as business units that want to build their own data-driven solutions quickly without corporate IT involvement. These are real needs and the data lake offers a wonderful way to address them.

But please don't believe that a data lake is going to easily give you enterprise views of your organization populated with clean, consistent, integrated data. Unless you have a full-fledged MDM environment and astute data architects, the data lake isn't going digitally unify your organization. And that will disappoint the CEO and CFO. To make the data lake work for everyone requires a comprehensive data governance program, something that few organizations have implemented and even fewer have deployed successfully.

Ultimately, the data lake is a response to and antidote for the repressive data culture that exists in many companies. We've given too much power to the control freaks (i.e. IT architects) who feel the need to normalize, model, and secure every piece of data that comes into the organization. Even data warehousing professionals feel this way; they have developed and evangelized more agile, flexible approaches to deliver information to the masses.

Frankly, the data lake courts chaos. And that's fine. We need a measure of data chaos to keep the data nazis in check. The real problem with the data lake is that there are no warning signs to caution unsuspecting business people about the dangers lurking in its waters.

In subsequent posts, I'll debunk the criticisms of the data warehouse by the data lakers and present a new reference architecture (actually an ecosystem) that shows how to blend the data lake and more traditional approaches into a happy, harmonious whole.

Posted March 12, 2014 4:49 PM
Permalink | 4 Comments |

The cloud eliminates the need to buy, install, and manage hardware and software, significantly reducing the cost of implementing BI solutions while speeding delivery times.

One new company hoping to cash in on the movement to run BI in the cloud is RedRock BI, which offers a complete BI stack in the cloud starting at $2,500 a month for up to 2TB of data. The service runs on Amazon EC2, leverages Amazon RedShift, and comes with a single-premise cloud upload utility, 120 hours of Syncsort's ETL service, a five-user license to the Yellowfin BI tools, and five hours of RedRock BI support.

This makes RedRock BI an order of magnitude cheaper than any other full-stack BI solution on the market, according to Doug Slemmer, who runs RedRock BI. And customers can expand their implementations inexpensively, he says. An additional 2TB of data costs $650 a month, 120 hours of Syncsort ETL costs $750 a month, and additional Yellowfin users go for $70 a month each.

Slemmer doesn't expect RedRock BI's current pricing advantage to continue indefinitely. He expects other firms will soon combine off-the-shelf BI services and tools to create affordable cloud-based BI packages for the mid-market and departments at larger companies. As a result, Slemmer said he hopes to capitalize on RedRock BI's first-mover advantage by aggressively promoting its services.

RedRock BI released its cloud service for general availability on February 28, and has one paying customer, Dickey's Barbecue Pit. Several more are conducting five proofs of concept. For more information, go to

Posted March 4, 2014 5:21 PM
Permalink | No Comments |
PREV 1 2 3 4

Search this blog
Categories ›
Archives ›
Recent Entries ›