Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

February 2012 Archives

(Editor's note: This is the fourth in a multi-part series on Big Data.)

Faced with an expanding analytical ecosystem, BI managers need to make many technology choices. Perhaps the most difficult involves selecting a data processing system to power a variety of analytical applications. (See "The New Analytical Ecosystem:
Making Way for Big Data
.")

In the past, such decisions revolved around selecting one of a handful of leading relational database management (RDBM) systems to power a data warehouse or data mart. Often, the choice boiled down to internal politics as much as technical functionality.

Today, the options aren't as straightforward, although politics may still play a role. Instead of selecting a single data management product, BI managers may need to select multiple platforms to outfit an expanding analytical ecosystem. And rather than evaluating four or five alternatives for each platform, the BI manager is faced with dozens of viable options in each category. The once lazy database market is now a beehive of activity!

Staying abreast of all the new products, partnerships, and technological advances is now a full-time job. Industry analysts who make a living sifting through products in emerging markets are needed now more than ever. Most analysts (including me) will tell you that the first step in selecting an analytical platform is to understand the broad categories of products in the marketplace, and then make finer distinctions from there. (See figure 1.)

Figure 1
Big Data DBMS Positioning.jpg

At a high-level, there are four categories of analytical processing systems available today: transactional RDBM systems, . The following describes those categories and can be used as a starting point when creating a short list of products during a product evaluation process.

1. Transactional RDBM Systems

Transactional RDBM systems were originally designed to support transaction processing applications although most have been retrofitted with various types of indexes, join paths, and custom SQL bolt-ons to make them more palatable to analytical processing. There are two types of transactional RDBM systems: enterprise and departmental.

  • Enterprise Hubs. The traditional enterprise RDBM systems, such as those from IBM, Oracle, and Sybase, are best suited as data warehousing hubs that feed a variety of downstream, end-user facing systems, but don't handle query traffic directly. Although retrofitted with analytical capabilities, these systems often hit performance and scalability walls when used for query processing along with other workloads and are expensive to upgrade and replace. Thus, many customers now use these "gray-bearded" data warehousing systems as hubs to feed operational data stores, data marts, enterprise reporting systems, analytical sandboxes, and various analytical and transactional applications.
  • Departmental Marts. A number of companies use Microsoft SQL or MySQL as data marts fed by an enterprise data warehouse or as stand-alone data warehouses for a business unit or small- or medium-size business (SMB). Like their enterprise brethren, these systems also often hit the wall when usage, data volumes, or query complexity increases rapidly. A fast-growing business unit or SMB often replaces these transactional RDBM systems with analytic appliances (see below) which provide the same or greater level of simplicity and ease of management as SQL Server or MySQL.

2. Analytic Platforms

Analytic platforms represent the first wave of Big Data systems. (See "Two Markets for Big Data: Comparing Value Propositions.") These are purpose-built SQL-based system designed to provide superior price-performance for analytical workloads compared to transactional RDBM systems. There are many types of analytic platforms. Most are being used as data warehousing replacements or stand-alone analytical systems.

  • MPP Database. Massively parallel processing (MPP) databases with strong mixed workload utilities make good enterprise data warehouses for analytically minded organizations. Teradata was the first on the block with such a system, but it now has many competitors, including EMC Greenplum and Microsoft's Parallel Data Warehousing Option, which are relative upstarts compared to the 30-year old Teradata.
  • Analytical Appliance. These purpose-built analytical systems come as an integrated hardware-software combination tuned for analytical workloads. Analytical appliances come in many shapes, sizes, and configurations. Some, like IBM Netezza, EMC Greenplum, and Oracle Exadata, are more general purpose analytical machines that can serve as replacements for most data warehouses. Others, such as those from Teradata, are geared to specific analytical workloads, such as delivering extremely fast performance or managing super large data volumes.
  • In-Memory Systems. If you are looking for raw performance, there is nothing better than a system that lets you put all your data into memory. These systems will soon become more commonplace, thanks to SAP, which is betting its business on HANA, an in-memory database for transactional and analytical processing, and is evangelizing the need for in-memory systems. Another contender in this space is Kognitio. Many RDBM systems are beginning to better exploit memory for caching results and processing queries.
  • Columnar. Columnar databases, such as SAP's Sybase IQ Hewlett Packard's Vertica, Paraccel, Infobright, Exasol, Calpont, and Sand offer fast performance for many types of queries because of the way these systems store and compress data by columns instead of rows. Column storage and processing is fast becoming a RDBM system feature rather than a distinct subcategory of products.

3. Hadoop Distributions

Hadoop is an open source software project run within the Apache Foundation for processing data-intensive applications in a distributed environment with built-in parallelism and failover. The most important parts of Hadoop are the Hadoop Distributed File System, which stores data in files on a cluster of servers, and MapReduce, a programming framework for building parallel applications that run on HDFS. The open source community is building numerous additional components to turn Hadoop into an enterprises-caliber, data processing environment. The collection of these components is called a Hadoop distribution. Leading providers of Hadoop distributions include Cloudera, IBM, EMC, Amazon, Hortonworks, and MapR.

Today, in most customer installations, Hadoop serves as a staging area and online archive for unstructured and semi-structured data, as well as an analytical sandbox for data scientists who query Hadoop files directly before the data is aggregated or loaded into the data warehouse. But this could change. Hadoop will play an increasingly important role in the analytical ecosystem at most companies, either working in concert with an enterprise DW or assuming most of its duties.

4. NoSQL Databases

NoSQL is the name given to a broad set of databases whose only common thread is that they don't require SQL to process data, although some support both SQL and non-SQL forms of data processing. There are many types of NoSQL databases, and the list grows longer every month. These specialized systems are built using either proprietary and open source components or a mix both. In most cases, they are designed to overcome the limitations of traditional RDBM systems to handle unstructured and semi-structured data. Here's a partial listing of NoSQL systems:

  • Key Value Pair Databases. These systems store data as a simple record structure consisting of a key and content. These are used for operational applications that require involve large volumes of data, flexible data structures, and fast transactions. Leading key value pair databases include Cassandra, Hbase, and Basho Riak.
  • Document Stores. These systems specialize in storing, parsing, and processing application objects, typically using lightweight structure, such as JSON. Like key value databases, document stores are used for high-volume, transaction processing. Leaders here include MongoDB and Couchbase.
  • SQL MapReduce. These systems allow users to use SQL to invoke MapReduce jobs running inside the database or associated file system. Teradata's Aster Data and EMC Greenplum support these capabilities.
  • Graph Systems. These database store associations among entities, making them popular among social media companies who need to track connections among people.
  • Unified Information Access. These systems, such as those from Attivio, MarkLogic, and Splunk, use more of a search storage and query paradigm to query both structured and unstructured data.
  • Other. There are many other NoSQL databases that vary by how they store and process data or the types of applications they are designed to support.

Summary

The above four categories represent just the start of a broader categorization of data processing systems geared to analytic workloads. This is a fast-moving field that is changing all the time. With the multiplicity of choices available today, the BI professional needs to understand the differences between data management offerings so they can position the properly within the new analytical ecosystem.


Posted February 22, 2012 2:00 PM
Permalink | No Comments |

(Editor's note: This is the fourth in a series on the Big Data Revolution.)


The Big Data revolution has arrived and it's transforming long-established data warehousing architectures into vibrant, multi-faceted analytical ecosystems.

Gone are the days when all analytical processing first passes through a data warehouse or data mart (or their less sanctified spreadmart or data shadow system brethren.) Now data winds its way to users through a multiplicity of corporate data structures, each tailored to the type of content it contains and the type of user who wants to consume it.

Figure 1 depicts a reference architecture for the new analytical ecosystem that has the fingerprints of Big Data all over it. The objects in blue represent the traditional data warehousing environment, while those in pink represent new architectural elements made possible by Big Data technologies, namely Hadoop, NoSQL databases, high-performance analytical engines (e.g. analytical appliances, MPP databases, in-memory databases), and interactive, in-memory visualization tools.

Most source data now flows through Hadoop, which primarily acts as a staging area and online archive. This is especially true for semi-structured data, such as log files and machine-generated data, but also for some structured data that companies can't cost-effectively store and process in SQL engines (e.g. call detail records in a telecommunications company.) From Hadoop, data is fed into a data warehousing hub, which often distributes data to downstream systems, such as data marts, operational data stores, and analytical sandboxes of various types, where users can query the data using familiar SQL-based reporting and analysis tools.

Today, data scientists analyze raw data inside Hadoop by writing MapReduce programs in Java and other languages. In the future, users will be able to query and process Hadoop data using familiar SQL-based data integration and query tools.

Figure 1. The New Analytical Ecosystem
BI Ecosystem.jpg

Harmonizing Opposites

The Big Data revolution is not only about analyzing large volumes and new sources of data, it's also about balancing data alignment and consistency with flexible, ad hoc exploration. As such, the new analytical ecosystem features both top-down and bottom-up data flows that meet all business requirements for reporting and analysis.

The top-down world. In the top-down world, source data is processed, refined, and stamped with a predefined data structure--typically a dimensional model--and then consumed by casual users using SQL-based reporting and analysis tools. In this domain, IT developers create data and semantic models so business users can get answers to known questions and executives can track performance of predefined metrics. Here, design precedes access. The top-down world also takes great pains to align data along conformed dimensions and deliver clean, accurate data. The goal is to deliver a consistent view of the business entities so users can spend their time making decisions instead of arguing about the origins and validity of data artifacts.

The under world. Creating a uniform view of the business from heterogeneous sets of data is not easy. It takes time, money, and patience, often more than most departmental heads and business analysts are willing to tolerate. They often abandon the top-down world for the underworld of spreadmarts and data shadow systems. Using whatever tools are readily available and cheap, these data hungry users create their own views of the business. Eventually, they spend more time collecting and integrating data than analyzing it, undermining their productivity and a consistent view of business information.

The bottom up world. The new analytical ecosystem brings these prodigal data users back into the fold. It carves out space within the enterprise environment for true ad hoc exploration and promotes the rapid development of analytical applications using in-memory departmental tools. In a bottom-up environment, users can't anticipate the questions they will ask on a daily or weekly basis or the data they'll need to answer those questions. Often, the data they need doesn't yet exist in the data warehouse.

The new analytical ecosystem creates analytical sandboxes that let power users explore corporate and local data on their own terms. These sandboxes include Hadoop, virtual partitions inside a data warehouse, and specialized analytical databases that offload data or analytical processing from the data warehouse or handle new untapped sources of data, such as Web logs or machine data. The new environment also gives department heads the ability to create and consume dashboards built with in-memory visualization tools that point both to a corporate data warehouse and other independent sources.

Combining top-down and bottom-up worlds is not easy. BI professionals need to assiduously guard data semantics while opening access to data. For their part, business users need to commit to adhering to corporate data standards in exchange for getting the keys to the kingdom. To succeed, organizations need robust data governance programs and lots of communication among all parties.

Summary. The Big Data revolution brings major enhancements to the BI landscape. First and foremost, it introduces new technologies, such as Hadoop, that make it possible for organizations to cost-effectively consume and analyze large volumes of semi-structured data. Second, it complements traditional top-down data delivery methods with more flexible, bottom-up approaches that promote ad hoc exploration and rapid application development.


Posted February 15, 2012 6:23 AM
Permalink | 2 Comments |


Editor's note: This is part III in a multi-part series on Big Data. To view part II "Big Data Liberation Theology", click here.

There are two types of Big Data in the market today. There is open source software, centered largely around Hadoop, which eliminates upfront licensing costs for managing and processing large volumes of data. And then there are new analytical engines, including appliances and column stores, which provide significantly higher price-performance than general purpose relational databases, which have dominated the market for three decades. Both sets of Big Data software deliver higher returns on investment than previous generations of data management technology, but in vastly different ways.

Hadoop

Free Software. Hadoop is an open source distributed file system available through the Apache Software Foundation that is capable of storing and processing large volumes of data in parallel across a grid of commodity servers. Hadoop emanated from large internet providers, such as Google and Yahoo, who needed a cost-effective way to build search indexes. They knew that traditional relational databases would be prohibitively expensive and technically unwieldy so they came up with a low-cost alternative that they built themselves and eventually gave to the Apache Software Foundation so others could benefit from their innovations.

Today, many companies are implementing Hadoop software from Apache as well as third party providers, such as Cloudera, Hortonworks, EMC, and IBM. Developers see Hadoop as a cost-effective way to get their arms around large volumes of data that they've never been able to do much with before. For the most part, companies use Hadoop to store, process, and analyze large volumes of Web log data so they can get a better feel for the browsing and shopping behavior of their customers. Previously, most companies outsourced the analysis of their clickstream data or simply let it "fall on the floor" since they didn't have a way to process it in a timely and cost-effective way.

Data Agnostic. Besides being free, the other major advantage of Hadoop software is that it's data agnostic. It can handle any type of data. Unlike a data warehouse or traditional relational database, Hadoop doesn't require administrators to model or transform data before they load it. With Hadoop, you don't define a structure for the data; you simply load and go. This significantly reduces the cost of preparing data for analysis compared to what happens in a data warehouse. Most experts assert that 60% to 80% of the cost of building a data warehouse, which can run into the tens of millions of dollars, involves extracting, transforming, and loading (ETL) data. Hadoop virtually eliminates this cost.

As a result, many companies are starting to use Hadoop as a general purpose staging area and archive for all their data. So, a telecommunications company can store 12 months of call detail records instead of aggregating that data in the data warehouse and rolling the details to offline storage. With Hadoop, they can keep all their data online and eliminate the cost of data archival systems. They can also let power users query Hadoop data directly if they want to access the raw data or can't wait for the aggregates to be loaded into the data warehouse.

Hidden Costs. Of course, nothing in technology is ever free. When it comes to processing data, you either "pay the piper" upfront, as in the data warehousing world, or at query time, as in the Hadoop world. Before querying Hadoop data, a developer needs to understand the structure of the data and all of its anomalies. With a clean, well understood, homogenous data set, this is not difficult. But most corporate data doesn't fit this description. So a Hadoop developer ends up playing the role of a data warehousing developer at query time, interrogating the data and making sure it's format and content match their expectations. Querying Hadoop today is a "buyer beware" environment.

Moreover, to run Big Data software, you still need to purchase, install, and manage commodity servers (unless you run your Big Data environment in the Cloud, say through Amazon Web Services). While each server may not cost a lot, collectively the price adds up.

But what's more costly is the expertise and software required to administer Hadoop and manage grids of commodity servers. Hadoop is still bleeding edge technology and few people have the skills or experience to run it efficiently in a production environment. These folks are hard to find, and they don't come cheap. Members of the Apache Software Foundation admit that Hadoop's latest release is equivalent to version 1.0 software, so even the experts have a lot to learn since the technology is evolving at a rapid pace. But nonetheless, Hadoop and its NoSQL brethren have opened up a vast new frontier for organizations to profit from their data.

Analytic Platforms

The other type of Big Data predates Hadoop and NoSQL variants by several years. This version of Big Data is less a "movement" than an extension of existing relational database technology optimized for query processing. These analytical platforms span a range of technology, from appliances and columnar databases to shared nothing, massively parallel processing databases. The common thread among them is that most are read-only environments that deliver exceptional price-performance compared to general purpose relational databases originally designed to run transaction processing applications.

Teradata laid the groundwork for the analytical platform market when it launched the first analytical appliance in the early 1980s. Sybase was also an early forerunner, shipping the first columnar database in the mid 1990s. Netezza kicked the current market into high gear in 2003 when it unveiled a popular analytical appliance, and was soon followed by dozens of startups. Recognizing the opportunity, all the big names in software and hardware--Oracle, IBM, Hewlett-Packard, and SAP--subsequently jumped into the market, either by building or buying technology, to provide purpose-built analytical systems to new and existing customers.

Although the pricetag of these systems often exceeds a million dollars, customers find that the exceptional price-performance delivers significant business value, in both tangible and intangible form. For example, XO Communications recovered $3 million in lost revenue from a new revenue assurance application it built on an analytical appliance, even before it had paid for the system! It subsequently built or migrated a dozen applications to run on the new purpose-built system, testifying to its value.

Kelley Blue Book purchased an analytical appliance to run its data warehouse, which was experiencing performance issues, giving the provider of online automobile valuations a competitive edge. For instance, the new system reduces the time needed to process hundreds of millions of automobile valuations from one week to one day, among other things. Kelley Blue Book now uses the system to analyze its Web advertising business and deliver dynamic pricing for its Web ads.

Challenges. Given the upfront costs of analytical platforms, organizations usually undertake a thorough evaluation of these systems before jumping on board.

First, companies must assess whether an analytical platform outperforms their existing data warehouse database to a degree that warrants migration and retraining costs. This requires a proof of concept (POC) in which customers test the systems in their own data center using their own data across a range of queries. The good news is that the new analytical platforms usually deliver jaw-dropping performance for most queries tested. In fact, many customers don't believe the initial results and rerun the queries to make sure that the results are valid.

Second, companies must choose from more than two dozen analytical platforms on the market today. For instance, they must decide whether to purchase an appliance or a software-only system, a columnar database or an MPP database, or an on-premise system or a Web service. Evaluating these options takes time and many companies create a short-list that doesn't always contain comparable products.

Finally, companies must decide what role an analytical platform will play in their data warehousing architectures. Should it serve as the data warehousing platform? If so, does it handle multiple workloads easily or is it a one-trick pony? If the latter, what applications and data sets makes sense to offload to the new system? How do you rationalize having two data warehousing environments instead of one?

Today, we find that companies which have tapped out their SQL Server or MySQL data warehouses often replace them with analytical platforms to get better performance. However, companies that have implemented an enterprise data warehouse on Oracle, Teradata, or IBM often find that the best use of analytical platforms is to sit alongside the data warehouse and offload existing analytical workloads or handle new applications. This architecture helps organizations avoid a costly upgrade to their data warhousing platform, which might easily exceed the cost of purchasing an analytical platform.

Summary

The Big Data movement consists of two separate, but interrelated, markets: one for Hadoop and open source data management software and the other for purpose-built SQL databases optimized for query processing. Hadoop avoids most of the upfront licensing and loading costs endemic to traditional relational database systems. However, since the technology is still immature, there are hidden costs that have thus far kept many Hadoop implementations experimental in nature. On the other hand, analytical platforms are a more proven technology, but impose significant upfront licensing fees and potential migration costs. Companies wading into the waters of the Big Data stream need to evaluate their options carefully.


Posted February 6, 2012 8:30 AM
Permalink | No Comments |