We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is founder and principal consultant at Eckerson Group,a research and consulting company focused on business intelligence, analytics and big data.

March 2014 Archives

Say you have a ton of data in Hadoop and you want to explore it. But you don't want to move it into another system. (After all, it's big data so why move it?) But you don't want to go through the hassle and expense of creating table schemas in Hadoop to support fast queries. (After all, this is not supposed to be a data warehouse.) So what do you do??

You Hunk it. That is, you search it using Splunk software that creates virtual indexes in Hadoop. With Hunk, you don't have to move the data out of Hadoop and into an outboard analytical engine (including Splunk Enterprise). And you don't need to create table schemas in advance or at run time to guide (and limit) queries along predefined pathways. With Hunk, you point and go. It's search for Hadoop, but more scalable and manageable than open source search engines, such as SOLR, according to Splunk officials.

Hunk generates MapReduce under the covers, so it's not an interactive query system. However, it does stream results immediately once the job starts, so an analyst can see whether his search criteria generates the desired results. If not, he can stop the search, change the criteria, and start again. So, it's as interactive as batch can get.

Also, since Hunk is a Hadoop search engine, you cannot do basic things you can do with SQL, such as join tables or add up columns easily or store data in a more compressed format. But it does let you search or explore data without specifying schema or other advanced setup.

And unlike Splunk Enterprise which only runs against log and sensor data, Splunk Hunk (gotta love that product name) can run against any data because it processes data using MapReduce. For instance, Hunk can search for videos with lots of red in them by invoking a a MapReduce function that identifies color patterns in videos. You can also run queries that span indexes created in Splunk Enterprise and Hunk, making Hunk a federated query tool. And like Splunk Enterprise, Hunk supports 100+ analytical functions, making it more than just a Hadoop search tool.

So, if you're in the market for a bonafide exploration tool for Hadoop, try Hunk.

For more information, see www.splunk.com.

Posted March 17, 2014 7:22 PM
Permalink | No Comments |

As silver bullets go, the "data lake" is a good one. Pitched by big data advocates, the data lake promises to speed the delivery of information and insights to the business community without the hassles imposed by IT-centric data warehousing processes. It almost seems too good to be true.

With a data lake, you simply dump all your data, both structured and unstructured, into the lake (i.e. Hadoop) and then let business people "distill" their own parochial views within it using whatever technology is best suited to the task (i.e. SQL or NoSQL, disk-based or in-memory databases, MPP or SMP.) And you create enterprise views by compiling and aggregating data from multiple local views. The mantra of the data lake is think global, act local. Not bad!

Data Lake Benefits. Assuming this approach works, there are many benefits. First, the data lake gives business users immediate access to all data. They don't have to wait for the data warehousing (DW) team to model the data or give them access. Rather, they shape the data however they want to meet local requirements. The data lake speeds delivery and offers unparalleled flexibility since nobody or no thing stands between business users and the data.

Second, data in the lake is not limited to relational or transactional data--the traditional fare served by data warehouses. The data lake can contain any type of data: clickstream, machine-generated, social media, and external data, and even audio, video, and text. It's a proverbial cornucopia of data delights for the data digerati.

Third, with a data lake, you never need to move the data. And that's important in the era of big data. The data streams into the lake and stays there. You process it in place using whatever technology you want and serve it up however users want. But the data never leaves the lake. It's one big body of water with many different fishing spots, one for every type of sportsman.

So, there is a lot to like about the data lake: it empowers business users, liberating them from the bonds of IT domination; it speeds delivery, enabling business units to stand up applications quickly; and it ushers in new types of data and technology that lower the costs of data processing while improving performance. So what's the problem?

Alligators in the Swamp

Uncharted territory. Although big data advocates are quite adept at promoting their stuff (and even better at bashing the data warehouse), they never tell you about the alligators in the swamp. Since very few companies have actually implemented a data lake, perhaps no one has seen the creatures yet. But they are there. In fact, the first razor-toothed amphibian that should cause your adrenalin to surge is the fact that the data lake is uncharted water. This stuff is so new, that only real risk-takers are willing to swim in the swamp.

Expensive. The risk, however, presents a great sales opportunity. Product and services vendors are more than willing to help you reap the benefits of the data lake, while minimizing the risk. Don't have Hadoop, MPP, in-memory engines, or SQL-on-Hadoop tools or any experience managing them? No problem, we can sell and implement those technologies for you. Don't know how to distill local and enterprise views from the lake? No worries, our consultants can help you architect, design, and even manage the lake for you. All you need to take a swim is money, and lots of it! That's the second danger: the threat to your budget.

Data governance. The biggest peril, however, is the subtle message that it's easy to create any view you want in the data lake. Proponents make it seem like the data lake's water has magical properties that automatically build local and enterprise views. But diving into the details, you discover that the data lake depends on comprehensive master data management (MDM) program. Before you can build views, you need to define and manage core metrics and dimensions, ideally in a consistent way across the enterprise. You then link together virtual tables using these entities to create the local or enterprise views you want.

The problem with this approach is that MDM is hard. It's hard for the same reason data warehousing is hard. Defining core entities is a business task that is fraught with politics. No one agrees how to define basic terms like "customer" or "sale". Therefore, the temptation is to simply build local solutions with local definitions. This meets the immediate need but does nothing to create a common view of the enterprise that executives need to run the business. An organization with lots of local views but no corporate view is like a chicken with its head cut off: it dashes madly here and there until it suddenly drops dead.

Of course, if your organization has invested in MDM, then building enterprise views in a data lake is easy. But the same is true of a data warehouse. When an MDM solution assumes the burden of reconciling business entities, then building a data warehouse is a swim in the lake, so to speak.

Courting Chaos

Let's be honest: the data lake is geared to power users who want and need immediate access to all data as well as business units that want to build their own data-driven solutions quickly without corporate IT involvement. These are real needs and the data lake offers a wonderful way to address them.

But please don't believe that a data lake is going to easily give you enterprise views of your organization populated with clean, consistent, integrated data. Unless you have a full-fledged MDM environment and astute data architects, the data lake isn't going digitally unify your organization. And that will disappoint the CEO and CFO. To make the data lake work for everyone requires a comprehensive data governance program, something that few organizations have implemented and even fewer have deployed successfully.

Ultimately, the data lake is a response to and antidote for the repressive data culture that exists in many companies. We've given too much power to the control freaks (i.e. IT architects) who feel the need to normalize, model, and secure every piece of data that comes into the organization. Even data warehousing professionals feel this way; they have developed and evangelized more agile, flexible approaches to deliver information to the masses.

Frankly, the data lake courts chaos. And that's fine. We need a measure of data chaos to keep the data nazis in check. The real problem with the data lake is that there are no warning signs to caution unsuspecting business people about the dangers lurking in its waters.

In subsequent posts, I'll debunk the criticisms of the data warehouse by the data lakers and present a new reference architecture (actually an ecosystem) that shows how to blend the data lake and more traditional approaches into a happy, harmonious whole.

Posted March 12, 2014 4:49 PM
Permalink | 6 Comments |

The cloud eliminates the need to buy, install, and manage hardware and software, significantly reducing the cost of implementing BI solutions while speeding delivery times.

One new company hoping to cash in on the movement to run BI in the cloud is RedRock BI, which offers a complete BI stack in the cloud starting at $2,500 a month for up to 2TB of data. The service runs on Amazon EC2, leverages Amazon RedShift, and comes with a single-premise cloud upload utility, 120 hours of Syncsort's ETL service, a five-user license to the Yellowfin BI tools, and five hours of RedRock BI support.

This makes RedRock BI an order of magnitude cheaper than any other full-stack BI solution on the market, according to Doug Slemmer, who runs RedRock BI. And customers can expand their implementations inexpensively, he says. An additional 2TB of data costs $650 a month, 120 hours of Syncsort ETL costs $750 a month, and additional Yellowfin users go for $70 a month each.

Slemmer doesn't expect RedRock BI's current pricing advantage to continue indefinitely. He expects other firms will soon combine off-the-shelf BI services and tools to create affordable cloud-based BI packages for the mid-market and departments at larger companies. As a result, Slemmer said he hopes to capitalize on RedRock BI's first-mover advantage by aggressively promoting its services.

RedRock BI released its cloud service for general availability on February 28, and has one paying customer, Dickey's Barbecue Pit. Several more are conducting five proofs of concept. For more information, go to www.redrockbi.com.

Posted March 4, 2014 5:21 PM
Permalink | No Comments |