Editor's note: This is part III in a multi-part series on Big Data. To view part II "Big Data Liberation Theology", click here.
There are two types of Big Data in the market today. There is open source software, centered largely around Hadoop, which eliminates upfront licensing costs for managing and processing large volumes of data. And then there are new analytical engines, including appliances and column stores, which provide significantly higher price-performance than general purpose relational databases, which have dominated the market for three decades. Both sets of Big Data software deliver higher returns on investment than previous generations of data management technology, but in vastly different ways.
Free Software. Hadoop is an open source distributed file system available through the Apache Software Foundation that is capable of storing and processing large volumes of data in parallel across a grid of commodity servers. Hadoop emanated from large internet providers, such as Google and Yahoo, who needed a cost-effective way to build search indexes. They knew that traditional relational databases would be prohibitively expensive and technically unwieldy so they came up with a low-cost alternative that they built themselves and eventually gave to the Apache Software Foundation so others could benefit from their innovations.
Today, many companies are implementing Hadoop software from Apache as well as third party providers, such as Cloudera, Hortonworks, EMC, and IBM. Developers see Hadoop as a cost-effective way to get their arms around large volumes of data that they've never been able to do much with before. For the most part, companies use Hadoop to store, process, and analyze large volumes of Web log data so they can get a better feel for the browsing and shopping behavior of their customers. Previously, most companies outsourced the analysis of their clickstream data or simply let it "fall on the floor" since they didn't have a way to process it in a timely and cost-effective way.
Data Agnostic. Besides being free, the other major advantage of Hadoop software is that it's data agnostic. It can handle any type of data. Unlike a data warehouse or traditional relational database, Hadoop doesn't require administrators to model or transform data before they load it. With Hadoop, you don't define a structure for the data; you simply load and go. This significantly reduces the cost of preparing data for analysis compared to what happens in a data warehouse. Most experts assert that 60% to 80% of the cost of building a data warehouse, which can run into the tens of millions of dollars, involves extracting, transforming, and loading (ETL) data. Hadoop virtually eliminates this cost.
As a result, many companies are starting to use Hadoop as a general purpose staging area and archive for all their data. So, a telecommunications company can store 12 months of call detail records instead of aggregating that data in the data warehouse and rolling the details to offline storage. With Hadoop, they can keep all their data online and eliminate the cost of data archival systems. They can also let power users query Hadoop data directly if they want to access the raw data or can't wait for the aggregates to be loaded into the data warehouse.
Hidden Costs. Of course, nothing in technology is ever free. When it comes to processing data, you either "pay the piper" upfront, as in the data warehousing world, or at query time, as in the Hadoop world. Before querying Hadoop data, a developer needs to understand the structure of the data and all of its anomalies. With a clean, well understood, homogenous data set, this is not difficult. But most corporate data doesn't fit this description. So a Hadoop developer ends up playing the role of a data warehousing developer at query time, interrogating the data and making sure it's format and content match their expectations. Querying Hadoop today is a "buyer beware" environment.
Moreover, to run Big Data software, you still need to purchase, install, and manage commodity servers (unless you run your Big Data environment in the Cloud, say through Amazon Web Services). While each server may not cost a lot, collectively the price adds up.
But what's more costly is the expertise and software required to administer Hadoop and manage grids of commodity servers. Hadoop is still bleeding edge technology and few people have the skills or experience to run it efficiently in a production environment. These folks are hard to find, and they don't come cheap. Members of the Apache Software Foundation admit that Hadoop's latest release is equivalent to version 1.0 software, so even the experts have a lot to learn since the technology is evolving at a rapid pace. But nonetheless, Hadoop and its NoSQL brethren have opened up a vast new frontier for organizations to profit from their data.
The other type of Big Data predates Hadoop and NoSQL variants by several years. This version of Big Data is less a "movement" than an extension of existing relational database technology optimized for query processing. These analytical platforms span a range of technology, from appliances and columnar databases to shared nothing, massively parallel processing databases. The common thread among them is that most are read-only environments that deliver exceptional price-performance compared to general purpose relational databases originally designed to run transaction processing applications.
Teradata laid the groundwork for the analytical platform market when it launched the first analytical appliance in the early 1980s. Sybase was also an early forerunner, shipping the first columnar database in the mid 1990s. Netezza kicked the current market into high gear in 2003 when it unveiled a popular analytical appliance, and was soon followed by dozens of startups. Recognizing the opportunity, all the big names in software and hardware--Oracle, IBM, Hewlett-Packard, and SAP--subsequently jumped into the market, either by building or buying technology, to provide purpose-built analytical systems to new and existing customers.
Although the pricetag of these systems often exceeds a million dollars, customers find that the exceptional price-performance delivers significant business value, in both tangible and intangible form. For example, XO Communications recovered $3 million in lost revenue from a new revenue assurance application it built on an analytical appliance, even before it had paid for the system! It subsequently built or migrated a dozen applications to run on the new purpose-built system, testifying to its value.
Kelley Blue Book purchased an analytical appliance to run its data warehouse, which was experiencing performance issues, giving the provider of online automobile valuations a competitive edge. For instance, the new system reduces the time needed to process hundreds of millions of automobile valuations from one week to one day, among other things. Kelley Blue Book now uses the system to analyze its Web advertising business and deliver dynamic pricing for its Web ads.
Challenges. Given the upfront costs of analytical platforms, organizations usually undertake a thorough evaluation of these systems before jumping on board.
First, companies must assess whether an analytical platform outperforms their existing data warehouse database to a degree that warrants migration and retraining costs. This requires a proof of concept (POC) in which customers test the systems in their own data center using their own data across a range of queries. The good news is that the new analytical platforms usually deliver jaw-dropping performance for most queries tested. In fact, many customers don't believe the initial results and rerun the queries to make sure that the results are valid.
Second, companies must choose from more than two dozen analytical platforms on the market today. For instance, they must decide whether to purchase an appliance or a software-only system, a columnar database or an MPP database, or an on-premise system or a Web service. Evaluating these options takes time and many companies create a short-list that doesn't always contain comparable products.
Finally, companies must decide what role an analytical platform will play in their data warehousing architectures. Should it serve as the data warehousing platform? If so, does it handle multiple workloads easily or is it a one-trick pony? If the latter, what applications and data sets makes sense to offload to the new system? How do you rationalize having two data warehousing environments instead of one?
Today, we find that companies which have tapped out their SQL Server or MySQL data warehouses often replace them with analytical platforms to get better performance. However, companies that have implemented an enterprise data warehouse on Oracle, Teradata, or IBM often find that the best use of analytical platforms is to sit alongside the data warehouse and offload existing analytical workloads or handle new applications. This architecture helps organizations avoid a costly upgrade to their data warhousing platform, which might easily exceed the cost of purchasing an analytical platform.
The Big Data movement consists of two separate, but interrelated, markets: one for Hadoop and open source data management software and the other for purpose-built SQL databases optimized for query processing. Hadoop avoids most of the upfront licensing and loading costs endemic to traditional relational database systems. However, since the technology is still immature, there are hidden costs that have thus far kept many Hadoop implementations experimental in nature. On the other hand, analytical platforms are a more proven technology, but impose significant upfront licensing fees and potential migration costs. Companies wading into the waters of the Big Data stream need to evaluate their options carefully.
Posted February 6, 2012 8:30 AM
Permalink | No Comments |