As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.
SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.
The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.
Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.
Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.
One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)
In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.
Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.
Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.
Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.
Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.
Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.
Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.
Posted January 4, 2011 11:17 AM
Permalink | No Comments |