Database Technology for the Web: Part 1 – The MapReduce Debate
by Colin White
Originally published June 24, 2009
Over the course of the nearly forty years I have been working on database systems, there have been many debates and arguments about which database technology to use for any given application. These arguments have become heated, especially when a new database technology appears that claims to be superior to anything that came before.
What is MapReduce?MapReduce has been popularized by Google that uses it to process many petabytes of data every day. A landmark paper by Jeffrey Dean and Sanjay Ghemawat of Google states that:
“MapReduce is a programming model and an associated implementation for processing and generating large data sets…. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.”
Michael Stonebraker’s comments on MapReduce explain MapReduce in more detail:
“The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.
The map program reads a set of records from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.
After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and fed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the answer to a MapReduce computation.”
The key/value pairs produced by the map program can contain any type of arbitrary data in the value field. Google, for example, uses this approach to index large volumes of unstructured data. Although Google uses its own version of MapReduce, there is also an open source version called Hadoop from the Apache project. IBM and Google have announced a major initiative to use Hadoop to support university courses in distributed computer programming.
MapReduce is not a new concept. It is based on the list processing capabilities in declarative functional programming languages such as LISP (LISt Processing). Today’s systems implement MapReduce in imperative languages such as Java, C++, Python, Perl, Ruby, etc.
The key/value pairs used in MapReduce processing may be stored in a file or a database system. Google uses its BigTable database system (which is built on top of the Google distributed file system, GFS) to manage the data.
Key/value pair databases have existed for many years. For example, Berkeley DB is an embedded database system that stores data in a key/value pair data structure. It was originally developed in the1980s at Berkeley, but it is now owned by Oracle. Berkeley DB can also act as a back end storage engine for the MySQL open source relational DBMS.
Why the Controversy?Given that MapReduce is not a database model, but a programming model for building powerful distributed and parallel processing applications, why is there such a controversy with respect to relational systems? To answer this question we need to examine the relational model of data in more detail.
In a relational model, data is conceptually stored in a set of relations or tables. These tables are manipulated using relational operators such as selection, projection and join. Today, these relational operators are implemented primarily using the structured query language (SQL).
How the table data is physically stored and managed in a relational database management system (RDBMS) is up to the vendor. The mapping of relational operators (SQL statements) to the back-end storage engine is handled by the relational optimizer whose job it is to find the optimal way of physically accessing the data. This physical data independence is a key benefit of the relational model.
When using SQL, users define what data they want, not how it is to be accessed. Techniques such as indexing and parallel and distributed computing are handled by the underlying RDBMS. SQL is a declarative language, and not an imperative/procedural language like Java and C++, which require a detailed description of any data access algorithms that need to be run. Of course, SQL statements can be embedded in procedural languages. The reverse is also true; SQL can invoke stored procedures and user-defined functions written in a procedural language.
The concern of Michael Stonebraker is that the use and teaching of MapReduce will take the industry back to the pre-relational times when there was a lack of formalized database schemas and application data independence. MapReduce advocates argue that much of the data processed by MapReduce involves unstructured data that lacks a data schema. They also argue that today’s programmers vastly outnumber SQL experts, don’t know or don’t want to know SQL, find MapReduce much simpler, and prefer to access and analyze data using their own procedural programming.
Both camps are correct and both approaches have their benefits and uses. As I said at the beginning of this article, one size does not fit all. The challenge is to understand where each approach fits.
Data Analysis Processing ModesWhen accessing and analyzing data there are three types of processing that need to be considered: batch processing of static data, interactive processing of static data, and dynamic processing of in-flight data. A business intelligence environment, for example, involves the SQL processing of static data in a data warehouse. This can be done in batch mode (production reporting) or interactively (on-demand analytical processing). SQL may also be used to analyze and transform data as it is captured from operational systems and loaded into a data warehouse.
MapReduce is used to process large amounts of data in batch mode. It is particularly useful for processing unstructured data or sparse data involving many dimensions. It is not suited to interactive processing. It would be very useful, for example, for transforming large amounts of unstructured data for loading into a data warehouse, or for data mining.
Neither MapReduce nor SQL are particularly suitable to the dynamic processing of in-flight data such as event data. This is why we are seeing extensions to SQL (such as StreamSQL) and new technologies such as stream and complex event processing to handle this need. MapReduce is, however, useful for the filtering and transforming of large event files such as web logs. The next article in this series will look at stream processing in more detail.
MapReduce and Relational Coexistence and IntegrationSeveral analytical RDBMS vendors (Vertica, Greenplum, Aster Data Systems) are offering solutions that combine MapReduce (MR) and relational technology.
Vertica’s strategy is one of coexistence. With Vertica, MR programs continue to run in their normal operating environment, but instead of routing the output to the MR system, the Reduce program loads output data into the Vertica relational DBMS. The Vertica support works in conjunction with Amazon Elastic MapReduce (EMR). EMR is a web service that provides a hosted Hadoop framework running on the infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). This link shows how to use EMR to process and load a data set from S3 into the Vertica RDBMS running on Amazon EC2.
The Vertica solution could be used, for example, to do batch ETL (extract, transform, load) processing where the input is a very large data set (a set of web logs, for example) and the output is loaded into a data warehouse managed by Vertica.
Aster and Greenplum have a strategy of integrating the MR processing framework into the RDBMS to take advantage of the benefits of RDBMS technology such as parallel computing, scalability, backup and recovery, and so forth.
Greenplum allows developers to write MR programs in the Python and Perl scripting languages. This support enables MR scripts to use open source features such as text analysis and statistical toolkits. These scripts can access flat files and web pages, and can use SQL to access Greenplum relational tables. Source tables can be read by Map scripts and target tables can be created by Reduce scripts. This architecture allows developers to mix and match data sources and programming styles. It also allows the building of a data warehouse using both ETL (data is transformed before it is loaded in the data warehousing environment) and ELT (data is transformed after it is loaded into the data warehousing environment) approaches.
Greenplum MR scripts can also be used as virtual tables by SQL statements – the MR job is run on the fly as part of the SQL query processing. Greenplum’s RDBMS engine executes all the code – SQL, Map scripts, Reduce scripts – on the same cluster of machines where the Greenplum database is stored.
For more information, see the Greenplum white paper.
Whereas Greenplum tends to emphasize the use of SQL in MR programs, Aster takes the opposite approach of focusing on the use of MR processing capabilities in SQL-based programs. Aster allows MR user-defined functions to be invoked using SQL. These functions can be written in languages such as Python, Perl, Java, C++ and Microsoft .NET (C#, F#, Visual Basic), and can use SQL data manipulation and data definition statements. The Linux .NET support is provided by the Mono open source product. These functions can also read and write data from flat files. Like Greenplum, Aster MR capabilities can be used for loading a data warehouse using both ETL and ELT approaches. Aster, however, tends to emphasize the power of the ELT approach.
For more information, see the Aster white paper.
Both Greenplum and Aster allow the combining of relational data with MapReduce style data. This is particularly useful for batch data transformation and integration applications, and intensive data mining operations. The approach used will depend on the application and the type of developer. In general, programmers may prefer the Greenplum approach, whereas SQL experts may prefer the Aster approach.
What About Performance?MapReduce supporters often state that MapReduce provides superior performance to relational. This obviously depends on the workload. Andrew Pavlo of Brown University together with Michael Stonebraker, David DeWitt and several others recently published a paper comparing the performance of two relational DBMSs (Vertica and an undisclosed row-oriented DBMS) with Hadoop MapReduce. The paper concluded that, “In general, the SQL DBMSs were significantly faster and required less code to implement each task, but took longer to tune and load the data.” It also acknowledged that, “In our opinion there is a lot to learn from both kinds of systems” and “…the APIs of the two classes of systems are clearly moving toward each other.”
ConclusionMapReduce has achieved significant visibility because of its use by Google and its ability to process large amounts of unstructured web data, and also because of the heated debate between the advocates of MapReduce and relational database technology experts.
Two things are clear. Programmers like the simplicity of MapReduce and there is a clear industry direction toward supporting MR capabilities in traditional DBMS systems.
MapReduce is particularly attractive for the batch processing of large files of unstructured data for use in a business intelligence system. My personal opinion is that if MR programs are being used to filter and transform unstructured data (documents, web pages, web logs, event files) for loading into a data warehouse, then I prefer an ETL approach to an ELT approach. This is because the ELT approach usually involves storing unstructured data in relational tables and manipulating it using SQL. I have seen many examples of these types of database applications, and this approach is guaranteed to give database designers heartburn.
At the same time, I accept that some organizations would prefer a single data management framework based on an RDBMS. This is one of the reasons why DBMS vendors added support for XML data and XQuery to their RDBMS products. My concern is that relational products and SQL are becoming overly complex, especially for application developers.
Recent articles by Colin White
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC