Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

Recently in Open Source Category

MP900399514.JPG

I don't think I've ever seen a market consolidate as fast the analytic platform market.

By definition, an analytic platform is an integrated hardware and software data management system geared to query processing and analytics that offers dramatically higher price-performance than general purpose systems. After talking with numerous customers of these systems, I am convinced they represent game-changing technology. As such, major database vendors have been tripping over themselves to gain the upper hand in this multi-billion dollar market.

Rapid Fire Acquisitions. Microsoft made the first move when it purchased Datallegro in July, 2008. But it's taken two years for Microsoft to port the technology to Windows and SQL Server so, ironically, it finds itself trailing the leaders. Last May, SAP acquired Sybase, largely for its mobile technology, but also for its Sybase IQ analytic platform, which has long been been the leading column-store database on the market and has done especially well in financial services. And SAP is sparking tremendous interest within its installed base for HANA, an in-memory appliance designed to accelerate query performance of SAP BW and other analytic applications.

Two months after SAP acquired Sybase, EMC snapped up massively parallel processing (MPP) database, Greenplum, and reportedly has done an excellent job executing new deals. Two months later, in September, 2010, IBM purchased the leading pureplay, Netezza, in an all cash deal worth $1.8 billion that could be a boon to Netezza if IBM can clearly differentiate between its multiple data warehousing offerings and execute well in the field.

And last month, Hewlett Packard, whose NeoView analytic platform died ingloriously last fall, scooped up Vertica, a market leading columnar database with many interesting scalability and availability features. And finally, Teradata this week announced it was purchasing AsterData, a MPP shared nothing database with rich SQL MapReduce functions that can perform deep analytics on both structured and unstructured data.

So, in the past nine months, the world's biggest high tech companies purchased five of the leading, pureplay analytic platforms. This rapid pace of consolidation is dizzying!

Consolidation Drivers

Fear and Loathing. Part of this consolidation frenzy is driven by fear. Namely, fear of being left out of the market. And perhaps fear of Oracle, whose own analytic platform, Exadata, has gathered significant market momentum, knocking unsuspecting rivals on their heels. Although pricey, Exadata not only fuels game-changing analytic performance, it now also supports transaction applications--a one-stop database engine that competitors may have difficulty derailing (unless Oracle shoots itself in the foot with uncompromising terms for licensing, maintenance, and proofs of concept.)

Core Competencies. These analytic platform vendors are now carving out market niches where they can outshine the rest. For Oracle, it's a high-performance, hybrid analytic/transaction system; SAP touts its in-memory acceleration (HANA) and a mature columnar database that supports real-time analytics and complex event processing; EMC Greenplum targets complex analytics against petabytes of data; Aster Data focuses on analytic applications in which SQL MapReduce is an advantage; Teradata touts its mixed workload management capabilities and workload-specific analytic appliances; IBM Netezza focuses on simplicity, fast deployments, and quick ROI; Vertica trumpets its scalability, reliability, and availability now that other vendors have added columnar storage and processing capabilities; Microsoft is pitching is PDW along with a series of data mart appliances and a BI appliance.

Pureplays Looking for Cover. The rush of acquisitions leaves a number of viable pureplays out in the cold. Without a big partner, these vendors will need to clearly articulate their positioning and work hard to gain beachheads within customer accounts. ParAccel, for example, is eyeing Fortune 100 companies with complex analytic requirements, targeting financial services where it says Sybase IQ is easy pickings. Dataupia is seeking cover in companies that have tens to hundreds of petabytes to query and store. Kognitio likes its chances with flexible cloud-based offerings that customers can bring inhouse if desired. InfoBright is targeting the open source MySQL market, while Sand Technology touts its columnar compression, data mart synchronization, and text parsing capabilities. Ingres is pursuing the open source data warehousing market, and its new Vectorwise technology makes it a formidable in-memory analytics processing platform.

Despite the rapid consolidation of the analytic platforms market, there is still obviously lots of choice left for customers eager to cash in on the benefits of purpose-built analytical machines that deliver dramatically higher price-performance than database management systems of the past. Although the action was fast and furious in 2010, the race has only just begun. So, fasten your seat belts as players jockey for position in the sprint to the finish.


Posted March 8, 2011 8:20 AM
Permalink | 3 Comments |

As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.

SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.

The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.

Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.

Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.

Pervasive DataRush

One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)

In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.

Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.

Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.

Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.

Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.

Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.

Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.


Posted January 4, 2011 11:17 AM
Permalink | No Comments |

I recently spoke with James Phillips, co-founder and senior vice president of products, at Membase, an emerging NoSQL provider that powers many highly visible Web applications, such as Zynga's Farmville and AOL's ad targeting applications. James helped clarify for me the role of NoSQL in today's big data architectures.

Membase, like many of its NoSQL brethren, is an open source, key-value database. Membase was designed to run on clusters of commodity servers so it could "solve transaction problems at scale," says Philips. Because of its transactional focus, Membase is not technology that I would normally talk about in the business intelligence (BI) sphere.

Same Challenges, Similar Solutions

However, today the transaction community is grappling with many of the same technical challenges as the BI community--namely, accessing and crunching large volumes of data in a fast, affordable way. Not coincidentally, the transactional community is coming up with many of the same solutions--namely, distributing data and processing across multiple nodes of commodity servers linked via high-speed interconnects. In other words, low-cost parallel processing.

Key-Value Pairs. But the NoSQL community differs in one major way from a majority of analytics vendors chasing large-scale parallel processing architectures: it relinquishes the relational framework in favor of key-value pair data structures. For data-intensive, Web-based applications that must dish up data to millions of concurrent online users in the blink of an eye, key-value pairs are a fast, flexible, and inexpensive approach. For example, you just pair a cookie with its ID, slam it into a file with millions of other key-value pairs, and distribute the files across multiple nodes in a cluster. A read works in reverse: the database finds the node with the right key-value pair to fulfill an application request and sends it along.

The beauty of NoSQL, according to Philips, is that you don't have to put data into a table structure or use SQL to manipulate it. "With NoSQL, you put the data in first and then figure out how to manipulate it," Phillips says. "You can continue to change the kinds of data you store without having to change schemas or rebuild indexes and aggregates." Thus, the NoSQL mantra is "store first, design later." This makes NoSQL systems highly flexible but programmatically intensive since you have to build programs to access the data. But since most NoSQL advocates are application developers (i.e. programmers), this model aligns with their strengths.

In contrast, most analytics-oriented database vendors and SQL-oriented BI professionals haven't given up on the relational model, although they are pushing it to new heights to ensure adequate scalability and performance when processing large volumes of data. Relational database vendors are embracing techniques, such as columnar storage, storage-level intelligence, built-in analytics, hardware-software appliances, and, of course, parallel processing across clusters of commodity servers. BI professionals are purchasing these purpose-built analytical platforms to address performance and availability problems first and foremost and data scalability issues secondarily. And that's where Hadoop comes in.

Hadoop. Hadoop is an open source analytics architecture for processing massively large volumes of structured and unstructured data in a cost-effective manner. Like its NoSQL brethren, Hadoop abandons the relational model in favor of a file-based, programmatic approach based on Java. And like Membase, Hadoop uses a scale-out architecture that runs on commodity servers and requires no predefined schema or query language. Many Internet companies today use Hadoop to ingest and pre-process large volumes of clickstream data which are then fed to a data warehouse for reporting and analysis. (However, many companies are also starting to run reports and queries directly against Hadoop.)

Membase has a strong partnership with Cloudera, one of the leading distributors of open source Hadoop software. Membase wants to create bidirectional interfaces with Hadoop to easily move data between the two systems.

Membase Technology

Membase's secret sauce--the thing that differentiates it from its NoSQL competitors, such as Cassandra, MongoDB, CouchDB, and Redis--is that it incorporates Memcache, an open source, caching technology. Memcache is used by many companies to provide reliable, ultra-fast performance for data-intensive Web applications that dish out data to millions of current customers. Today, many customers manually integrate Memcache with a relational database that stores cached data on disk to store transactions or activity for future use.

Membase, on the other hand, does that integration upfront. It ties Memcache to a MySQL database which stores transactions to disk in a secure, reliable, and highly performant way. Membase then keeps the cache populated with working data that it pulls rapidly from disk in response to application requests. Because Membase distributes data across a cluster of commodity servers, it offers blazingly fast and reliable read/write performance required by the largest and most demanding Web applications.

Document Store. Membase will soon transform itself from a pure key-value database to a document store (a la MongoDB.) This will give developers the ability to write functions that manipulate data inside data objects stored in predefined formats (e.g. JSON, Avro, or Protocol Buffers.) Today, Membase can't "look inside" data objects to query, insert, or append information that the objects contain; it largely just dumps object values into an application.

Phillips said the purpose of the new document architecture is support predefined queries within transactional applications. He made it clear that the goal isn't to support ad hoc queries or compete with analytics vendors: "Our customers aren't asking for ad hoc queries or analytics; they just want super-fast performance for pre-defined application queries."

Pricing. Customers can download a free community edition of Membase or purchase an
annual subscription that provides support, packaging, and quality assurance testing. Pricing starts at $999 per node.


Posted December 23, 2010 9:38 AM
Permalink | 2 Comments |