(Editor's note: This is the fourth in a multi-part series on Big Data.)
Faced with an expanding analytical ecosystem, BI managers need to make many technology choices. Perhaps the most difficult involves selecting a data processing system to power a variety of analytical applications. (See "The New Analytical Ecosystem:
Making Way for Big Data.")
In the past, such decisions revolved around selecting one of a handful of leading relational database management (RDBM) systems to power a data warehouse or data mart. Often, the choice boiled down to internal politics as much as technical functionality.
Today, the options aren't as straightforward, although politics may still play a role. Instead of selecting a single data management product, BI managers may need to select multiple platforms to outfit an expanding analytical ecosystem. And rather than evaluating four or five alternatives for each platform, the BI manager is faced with dozens of viable options in each category. The once lazy database market is now a beehive of activity!
Staying abreast of all the new products, partnerships, and technological advances is now a full-time job. Industry analysts who make a living sifting through products in emerging markets are needed now more than ever. Most analysts (including me) will tell you that the first step in selecting an analytical platform is to understand the broad categories of products in the marketplace, and then make finer distinctions from there. (See figure 1.)
At a high-level, there are four categories of analytical processing systems available today: transactional RDBM systems, . The following describes those categories and can be used as a starting point when creating a short list of products during a product evaluation process.
1. Transactional RDBM Systems
Transactional RDBM systems were originally designed to support transaction processing applications although most have been retrofitted with various types of indexes, join paths, and custom SQL bolt-ons to make them more palatable to analytical processing. There are two types of transactional RDBM systems: enterprise and departmental.
- Enterprise Hubs. The traditional enterprise RDBM systems, such as those from IBM, Oracle, and Sybase, are best suited as data warehousing hubs that feed a variety of downstream, end-user facing systems, but don't handle query traffic directly. Although retrofitted with analytical capabilities, these systems often hit performance and scalability walls when used for query processing along with other workloads and are expensive to upgrade and replace. Thus, many customers now use these "gray-bearded" data warehousing systems as hubs to feed operational data stores, data marts, enterprise reporting systems, analytical sandboxes, and various analytical and transactional applications.
- Departmental Marts. A number of companies use Microsoft SQL or MySQL as data marts fed by an enterprise data warehouse or as stand-alone data warehouses for a business unit or small- or medium-size business (SMB). Like their enterprise brethren, these systems also often hit the wall when usage, data volumes, or query complexity increases rapidly. A fast-growing business unit or SMB often replaces these transactional RDBM systems with analytic appliances (see below) which provide the same or greater level of simplicity and ease of management as SQL Server or MySQL.
2. Analytic Platforms
Analytic platforms represent the first wave of Big Data systems. (See "Two Markets for Big Data: Comparing Value Propositions.") These are purpose-built SQL-based system designed to provide superior price-performance for analytical workloads compared to transactional RDBM systems. There are many types of analytic platforms. Most are being used as data warehousing replacements or stand-alone analytical systems.
- MPP Database. Massively parallel processing (MPP) databases with strong mixed workload utilities make good enterprise data warehouses for analytically minded organizations. Teradata was the first on the block with such a system, but it now has many competitors, including EMC Greenplum and Microsoft's Parallel Data Warehousing Option, which are relative upstarts compared to the 30-year old Teradata.
- Analytical Appliance. These purpose-built analytical systems come as an integrated hardware-software combination tuned for analytical workloads. Analytical appliances come in many shapes, sizes, and configurations. Some, like IBM Netezza, EMC Greenplum, and Oracle Exadata, are more general purpose analytical machines that can serve as replacements for most data warehouses. Others, such as those from Teradata, are geared to specific analytical workloads, such as delivering extremely fast performance or managing super large data volumes.
- In-Memory Systems. If you are looking for raw performance, there is nothing better than a system that lets you put all your data into memory. These systems will soon become more commonplace, thanks to SAP, which is betting its business on HANA, an in-memory database for transactional and analytical processing, and is evangelizing the need for in-memory systems. Another contender in this space is Kognitio. Many RDBM systems are beginning to better exploit memory for caching results and processing queries.
- Columnar. Columnar databases, such as SAP's Sybase IQ Hewlett Packard's Vertica, Paraccel, Infobright, Exasol, Calpont, and Sand offer fast performance for many types of queries because of the way these systems store and compress data by columns instead of rows. Column storage and processing is fast becoming a RDBM system feature rather than a distinct subcategory of products.
3. Hadoop Distributions
Hadoop is an open source software project run within the Apache Foundation for processing data-intensive applications in a distributed environment with built-in parallelism and failover. The most important parts of Hadoop are the Hadoop Distributed File System, which stores data in files on a cluster of servers, and MapReduce, a programming framework for building parallel applications that run on HDFS. The open source community is building numerous additional components to turn Hadoop into an enterprises-caliber, data processing environment. The collection of these components is called a Hadoop distribution. Leading providers of Hadoop distributions include Cloudera, IBM, EMC, Amazon, Hortonworks, and MapR.
Today, in most customer installations, Hadoop serves as a staging area and online archive for unstructured and semi-structured data, as well as an analytical sandbox for data scientists who query Hadoop files directly before the data is aggregated or loaded into the data warehouse. But this could change. Hadoop will play an increasingly important role in the analytical ecosystem at most companies, either working in concert with an enterprise DW or assuming most of its duties.
4. NoSQL Databases
NoSQL is the name given to a broad set of databases whose only common thread is that they don't require SQL to process data, although some support both SQL and non-SQL forms of data processing. There are many types of NoSQL databases, and the list grows longer every month. These specialized systems are built using either proprietary and open source components or a mix both. In most cases, they are designed to overcome the limitations of traditional RDBM systems to handle unstructured and semi-structured data. Here's a partial listing of NoSQL systems:
- Key Value Pair Databases. These systems store data as a simple record structure consisting of a key and content. These are used for operational applications that require involve large volumes of data, flexible data structures, and fast transactions. Leading key value pair databases include Cassandra, Hbase, and Basho Riak.
- Document Stores. These systems specialize in storing, parsing, and processing application objects, typically using lightweight structure, such as JSON. Like key value databases, document stores are used for high-volume, transaction processing. Leaders here include MongoDB and Couchbase.
- SQL MapReduce. These systems allow users to use SQL to invoke MapReduce jobs running inside the database or associated file system. Teradata's Aster Data and EMC Greenplum support these capabilities.
- Graph Systems. These database store associations among entities, making them popular among social media companies who need to track connections among people.
- Unified Information Access. These systems, such as those from Attivio, MarkLogic, and Splunk, use more of a search storage and query paradigm to query both structured and unstructured data.
- Other. There are many other NoSQL databases that vary by how they store and process data or the types of applications they are designed to support.
The above four categories represent just the start of a broader categorization of data processing systems geared to analytic workloads. This is a fast-moving field that is changing all the time. With the multiplicity of choices available today, the BI professional needs to understand the differences between data management offerings so they can position the properly within the new analytical ecosystem.
Posted February 22, 2012 2:00 PM
Permalink | No Comments |