We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is founder and principal consultant at Eckerson Group,a research and consulting company focused on business intelligence, analytics and big data.

The hype and reality of the Big Data movement was on full display this week at Strata Conference in Santa Clara, California. With a sold-out show of 2,000+ attendees and 40+ sponsors, the conference was the epicenter of all things Hadoop and NoSQL--technologies which are increasingly gaining a foothold in corporate computing environments.

Most of the leading Hadoop distributions--Cloudera, Hortonworks, EMC Greenplum, and MapR--already count hundreds of customers. And it's clear that Big Data has moved from the province of Internet and media companies with large Web properties to nearly every industry. Strata speakers described compelling Big Data applications in energy, pharmaceuticals, utilities, financial services, insurance, and government.

For example, IBM has 200 customers using or testing its BigInsights Hadoop distribution, according to Anjul Bhambhri, vice president of Big Data at Big Blue. One IBM customer, Vestas Wind Systems, a leading wind turbine maker, uses BigInsights to model larger volumes of weather data so it can pinpoint the optimal placement of wind turbines. And a financial services customer uses BigInsights to improve the accuracy of its fraud models by addressing much larger volumes of transaction data.

Big Data Drivers

Hadoop clearly fills an unmet need in many organizations. Given its open source roots, Hadoop provides a more cost effective way to analyze large volumes of data compared to traditional relational database management systems (RDBMS). It's also better suited to processing unstructured data, such as audio, video, or images, and semi-structured data, such as Web log data for tracking customer behavior on social media sites. For years, leading-edge companies have struggled in vain to figure out an optimal way to analyze this type of data in traditional data warehousing environments, but without much luck. (See "Let the Revolution Begin: Big Data Liberation Theology.")

Finally, Hadoop is a load-and-go environment: administrators can dump the data into Hadoop without having to convert it into a particular structure. Then, users (or data scientists) can analyze the data using whatever tools they want, which today are typically languages, such as Java, Python, and Ruby. This type of data management paradigm appeals to application developers and analysts, who often feel straitjacketed by top-down, IT-driven architectures and SQL-based toolsets. (See "The New Analytical Ecosystem: Making Way for Big Data.")

Speed Bumps

But Hadoop is not a data management panacea. It's clearly at or near the apogee of its hype cycle right now, and its many warts will disillusion all but bleeding- and leading-edge adopters.

For starters, Hadoop is still very green behind the ears. The Apache Foundation just released the equivalent of version 1.0. So there are plenty of basic things missing from the environment--like security, a metadata catalog, data quality, backups, and monitoring and control. Moreover, it's a batch processing environment that is not terribly efficient in the way it exploits a clustered environment. Hadoop knock-offs, like MapR, which embed proprietary technology underneath Hadoop APIs claim up to five-fold faster performance on half as many nodes.

In addition, to actually run a Hadoop environment, you need to get software from a mishmash of Apache projects, with razzle dazzle names like Flume, Sqoop, Ooze, Pig, Hive, and Zookeeper. These independent projects often contain competing functionality, have separate release schedules, and aren't always tightly integrated. And each project evolves rapidly. That's why there is a healthy market for Hadoop distributions that package these components into a reasonable set of implementable software.

But the biggest complaint among Big Data advocates is the current lack of data scientists to build Hadoop applications. These "wunderkinds" combine a rare set of skills: statistics and math, data, process and domain knowledge, and computer programming. Unfortunately, developers have little data and domain experience and data experts don't know how to program. So there is a severe shortage of talent. Many companies are hiring four people with relevant skills to create a virtual data scientist.


One good thing about the Big Data movement is that it evolves fast. There are Apache projects to address most of the shortcomings of Hadoop. One promising project is Hive, which provides SQL-like access to Hadoop, although it's stuck in a batch processing paradigm. Another is HBase, which overcomes Hadoop's latency issues, but is designed for fast row-based reads/writes to support high performance transactional applications. Both create table-like structures on top of Hadoop files.

In addition, many commercial vendors have jumped into the fray, marrying proprietary technology with open source software to turn Hadoop into a more corporate-friendly compute environment. Vendors, such as Zettaset, EMC Greenplum, and Oracle have launched appliances that embed Hadoop with commercial software to offer customers the best of both worlds. Many BI and data integration vendors now connect to Hadoop and can move data back and forth seamlessly. Some even create and run MapReduce jobs in Hadoop using their standard visual development environments.

Perhaps the biggest surprise at Strata was Microsoft's announcement that it plans to open source its Big Data software by donating it to the Apache Foundation. Microsoft has ported Hadoop to Windows Server and is working on an ODBC driver that works with Hive as well as a Javascript framework for creating MapReduce jobs. These products will open Hadoop to millions of Microsoft developers. And of course, Oracle has already released a Hadoop appliance that embeds Cloudera's Hadoop distribution. If Microsoft and Oracle are on board, there's little that can stop the Big Data train.

Cooperation or Competition?

Although vendors are quick to jump on the Big Data bandwagon, there is some measure of desperation in the move. Established software vendors stand to lose significant revenue if Hadoop evolves without them and gains robust data management and analytical functionality that cannibalizes their existing products. They either need to generate sufficient revenue from new Big Data products or circumscribe Hadoop so that it plays a subservient role to their existing products. Most vendors are hedging their bets and playing both options, especially database vendors who perhaps have the most to lose.

In the spotlight of Strata Conference, both sides are playing nice and are eager to partner and work together. Hadoop vendors benefit as more applications run on Hadoop, including traditional BI, ETL, and DBMS products. And commercial vendors benefit if their existing tools have a new source of data to connect to and plumb. It's a big new market whose sweet tasting honey attracts a hive full of bees.

Why Invest in Proprietary Tools? But customers are already asking whether data warehouses and BI tools will eventually be folded into Hadoop environments or the reverse. Why spend millions of dollars on a new analytical RDBMS if you can do that processing without paying a dime in license costs using Hadoop? Why spend hundreds of thousands of dollars on data integration tools if your data scientists can turn Hadoop into a huge data staging and transformation layer? Why invest in traditional BI and reporting tools if your power users can exploit Hadoop using freely available programs, such as Java, Python, Pig, Hive, or Hbase?

The Future is Cloudy

Right now, it's too early to divine the future of the Big Data movement and predict winners and losers. It's possible that in the future all data management and analysis will run entirely on open source platforms and tools. But it's just as likely that commercial vendors will co-opt (or outright buy) open source products and functionality and use them as pipelines to magnify sales of their commercial products.

More than likely, we'll get a mélange of open source and commercial capabilities. After all, 30 years after the mainframe revolution, mainframes are still a mainstay at many corporations. In information technology, nothing ever dies; it just finds its niche in an evolutionary ecosystem.

Posted March 2, 2012 9:03 AM
Permalink | No Comments |

Leave a comment

Search this blog
Categories ›
Archives ›
Recent Entries ›