We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is founder and principal consultant at Eckerson Group,a research and consulting company focused on business intelligence, analytics and big data.

Hadoop was designed as a batch processing environment. But oddly, most people view Hadoop as an exploratory environment in which data scientists (i.e. quintessential power users) mine mountains of data and find valuable insights. Many companies are eager to unleash their data scientists on Web logs and Twitter feeds to better understand customer shopping behavior and sentiment, among other things.

The reality today is that Hadoop is too slow to support iterative analysis. It not only runs in batch, but it's not even a terribly efficient batch environment. Hadoop version 1.0, at least, has no concept of joins, so programmers must string together hundreds of MapReduce jobs to execute relatively simple queries. There is minimal workload management so a single query will consume all the resources of a single cluster or partition.(Hadoop 2.0 addresses these and other deficiencies.)

Consequently, most companies today use Hadoop as a gigantic extraction and transformation engine that captures, stores, and processes semi-structured data and pushes the results into SQL-based environments where business users query and analyze the data using familiar SQL-based tools. Interestingly, few companies use Hadoop for other batch-oriented analytical workloads, such as scheduled reporting and data mining. The hype almost exclusively emphasizes exploration and discovery.

Real-time or Bust!

So, Hadoop's analytical mystique belies the facts. This is the primary reason that leading Hadoop vendors have rushed this year to unveil real-time query engines that run inside Hadoop. Cloudera launched Impala, EMC Greenplum announced Hawq, and Hortonworks is betting on Hive. All three not only claim to turn Hadoop into a bonafide, iterative analytical environment, they also support SQL- or SQL-like interfaces to make it easier for non-data scientists to access Hadoop data. Of course, these environments are brand new and relatively untested. The jury is out whether they truly can reinvent Hadoop in the real-time image its ardent supporters envision.

The alternative is to move the data into SQL-based analytical engines, such as Teradata or IBM Netezza, that are designed to run complex queries and analytical functions against terabytes of data. Every database vendor now offers connectors to move data from Hadoop into their proprietary environments. But these systems come with a steep pricetag and require IT administrators to move large volumes of data across thin pipes--not a smart thing to do if we're truly talking about "big data."

Consequently, early adopters of Hadoop have asked vendors to deliver a real-time query engine and they have heeded their calls. If Hadoop truly supports real-time queries, it could reinvent the entire analytical landscape and make investors in Hadoop startups fabulously wealthy. But don't rush to call your broker: Hadoop will have a longer trail to real-time nirvana than the relational database management system (RDBMS), which has been on that path for more than 30 years.

The cynic in me says that the Hadoop community is now trying to recreate the RDBMS on an open source platform. This could take awhile. But maybe it's the journey, not the destination that really counts. We've learned a lot about ourselves by examining and tinkering with this new alternative data processing platform. Hadoop clarifies the strengths and weaknesses of our SQL-based world, putting them in sharp relief. So, I welcome Hadoop and believe it will play an increasingly critical role in most BI environments.


Posted April 16, 2013 1:57 PM
Permalink | No Comments |

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›