I recently spoke with James Phillips, co-founder and senior vice president of products, at Membase, an emerging NoSQL provider that powers many highly visible Web applications, such as Zynga's Farmville and AOL's ad targeting applications. James helped clarify for me the role of NoSQL in today's big data architectures.
Membase, like many of its NoSQL brethren, is an open source, key-value database. Membase was designed to run on clusters of commodity servers so it could "solve transaction problems at scale," says Philips. Because of its transactional focus, Membase is not technology that I would normally talk about in the business intelligence (BI) sphere.
Same Challenges, Similar Solutions
However, today the transaction community is grappling with many of the same technical challenges as the BI community--namely, accessing and crunching large volumes of data in a fast, affordable way. Not coincidentally, the transactional community is coming up with many of the same solutions--namely, distributing data and processing across multiple nodes of commodity servers linked via high-speed interconnects. In other words, low-cost parallel processing.
Key-Value Pairs. But the NoSQL community differs in one major way from a majority of analytics vendors chasing large-scale parallel processing architectures: it relinquishes the relational framework in favor of key-value pair data structures. For data-intensive, Web-based applications that must dish up data to millions of concurrent online users in the blink of an eye, key-value pairs are a fast, flexible, and inexpensive approach. For example, you just pair a cookie with its ID, slam it into a file with millions of other key-value pairs, and distribute the files across multiple nodes in a cluster. A read works in reverse: the database finds the node with the right key-value pair to fulfill an application request and sends it along.
The beauty of NoSQL, according to Philips, is that you don't have to put data into a table structure or use SQL to manipulate it. "With NoSQL, you put the data in first and then figure out how to manipulate it," Phillips says. "You can continue to change the kinds of data you store without having to change schemas or rebuild indexes and aggregates." Thus, the NoSQL mantra is "store first, design later." This makes NoSQL systems highly flexible but programmatically intensive since you have to build programs to access the data. But since most NoSQL advocates are application developers (i.e. programmers), this model aligns with their strengths.
In contrast, most analytics-oriented database vendors and SQL-oriented BI professionals haven't given up on the relational model, although they are pushing it to new heights to ensure adequate scalability and performance when processing large volumes of data. Relational database vendors are embracing techniques, such as columnar storage, storage-level intelligence, built-in analytics, hardware-software appliances, and, of course, parallel processing across clusters of commodity servers. BI professionals are purchasing these purpose-built analytical platforms to address performance and availability problems first and foremost and data scalability issues secondarily. And that's where Hadoop comes in.
Hadoop. Hadoop is an open source analytics architecture for processing massively large volumes of structured and unstructured data in a cost-effective manner. Like its NoSQL brethren, Hadoop abandons the relational model in favor of a file-based, programmatic approach based on Java. And like Membase, Hadoop uses a scale-out architecture that runs on commodity servers and requires no predefined schema or query language. Many Internet companies today use Hadoop to ingest and pre-process large volumes of clickstream data which are then fed to a data warehouse for reporting and analysis. (However, many companies are also starting to run reports and queries directly against Hadoop.)
Membase has a strong partnership with Cloudera, one of the leading distributors of open source Hadoop software. Membase wants to create bidirectional interfaces with Hadoop to easily move data between the two systems.
Membase's secret sauce--the thing that differentiates it from its NoSQL competitors, such as Cassandra, MongoDB, CouchDB, and Redis--is that it incorporates Memcache, an open source, caching technology. Memcache is used by many companies to provide reliable, ultra-fast performance for data-intensive Web applications that dish out data to millions of current customers. Today, many customers manually integrate Memcache with a relational database that stores cached data on disk to store transactions or activity for future use.
Membase, on the other hand, does that integration upfront. It ties Memcache to a MySQL database which stores transactions to disk in a secure, reliable, and highly performant way. Membase then keeps the cache populated with working data that it pulls rapidly from disk in response to application requests. Because Membase distributes data across a cluster of commodity servers, it offers blazingly fast and reliable read/write performance required by the largest and most demanding Web applications.
Document Store. Membase will soon transform itself from a pure key-value database to a document store (a la MongoDB.) This will give developers the ability to write functions that manipulate data inside data objects stored in predefined formats (e.g. JSON, Avro, or Protocol Buffers.) Today, Membase can't "look inside" data objects to query, insert, or append information that the objects contain; it largely just dumps object values into an application.
Phillips said the purpose of the new document architecture is support predefined queries within transactional applications. He made it clear that the goal isn't to support ad hoc queries or compete with analytics vendors: "Our customers aren't asking for ad hoc queries or analytics; they just want super-fast performance for pre-defined application queries."
Pricing. Customers can download a free community edition of Membase or purchase an
annual subscription that provides support, packaging, and quality assurance testing. Pricing starts at $999 per node.
Posted December 23, 2010 9:38 AM
Permalink | No Comments |