Strata+Hadoop World 2012 held in New York City several weeks ago showcased dozens of big data technologies, showing the breadth and depth of solutions that now run on or integrate with Hadoop. In a prior blog, I discussed the overarching trends fueling the big data movement and its implications for traditional data warehousing and business intelligence vendors and ecosystems. (See "TDWI 1997 Versus Hadoop World 2012".)
But in this blog, I want to discuss the numerous analytical, data integration, database, and infrastructure vendors I met with, most of whom were touting innovative products designed to meet emerging big data needs. Perhaps the two most interesting categories of tools I saw are those that provide real-time SQL queries against Hadoop and end-to-end analytical tools with embedded data processing capabilities.
Impala. One significant limitation of Hadoop is that it's a batch processing environment that can't support real-time queries. This makes it challenging for business analysts to explore Hadoop data efficiently and effectively without moving it into a relational database. To remedy this situation, several vendors announced or exhibited products that embed real-time query engines inside Hadoop. Chief among these is Cloudera which announced that a new Apache project called Impala, currently in beta, that embeds a real-time query engine alongside MapReduce. This should give some relief to Hadoop users who find Hive, a SQL-like language that runs against virtual tables in Hadoop, too slow and cumbersome for explorative querying.
Impala, which Cloudera calls Real-Time Query for customers who pay for support, currently runs as a separate processing engine alongside MapReduce with its own parallel processing framework. According to CEO Mike Olson, Cloudera plans to port Impala to the next-generation of Hadoop, called Yarn, which will provide native support for alternate data processing engines besides MapReduce. Olsen also said Cloudera will eventually upgrade Impala, which supports HiveQL, to support ANSI standard SQL. HiveQL is a SQL-like language that is missing many basic SQL functions, making it challenging for BI vendors to provide customers with rich SQL support for Hadoop data.
Hadapt. One vendor that beat Cloudera to the punch is Hadapt, which implements a Postgres database on every node in a Hadoop cluster. It then converts ANSI-standard SQL queries into MapReduce, which parallelizes the queries and then converts the MapReduce code into SQL for processing in Postgres. This lets BI users and BI tools submit rich SQL queries against structured data stored in Hadoop. To overcome the batch processing constraint, Hadapt just announced support for real-time queries that bypass the MapReduce layer in Hadoop, just like Cloudera's Impala. Hadapt also embeds the Solr search engine for processing text.
Others. It's more than likely that other Hadoop vendors, such as MapR and Hortonworks, as well as startups, will jump on the real-time query bandwagon in the near future. This trend will gain momentum when the Apache Hadoop Community elevates Yarn--otherwise known as Hadoop 2.0, from alpha into general release sometime in the next year or so. But it remains to be seen whether these products are more than just klugey workarounds of Hadoop's batch processing environment.
End-to-End Analytical Toolsets
The current way to query Hadoop data in real time is to move the data out of Hadoop and into an analytical platform optimized for real-time query processing. This is what most BI and data warehousing vendors recommend. But the big data market has spawned a host of new analytical vendors hawking end-to-end BI capabilities. Here are a few I met with at Strata+Hadoop World 2012.
SiSense. SiSense provides a complete analytical processing environment that includes an in-memory columnar database, a visual data mashup tool, and visualization software based on the D3.js library from the W3 Consortium. SiSense's claim to fame is that it can process enormous volumes of structured data at an extremely low price. At Strata+Hadoop World 2012, the company showed how it could analyze 1TB of data on a $750 laptop with 8GB of RAM. The product's secret sauce is its ability to combine memory-based computing with a vectorwise, columnar database that places no limit on data volumes while processing most queries in memory.
Platfora. Startup Platfora also provides a complete analytical environment but unlike SiSense, its visual ETL tools generate MapReduce jobs that load data into a classic star schema models running on an in-memory columnar database that can be incrementally updated as new data arrives in Hadoop. The MapReduce jobs can also parse and aggregate semi-structured data so that it can be analyzed in Platfora, which offers both SQL and REST interfaces and can run in the cloud or on premise, usually adjacent to an existing Hadoop cluster.
Alteryx. Alteryx recently expanded its focus from a spatial analytics provider to an all-in-one analytical environment geared to business analysts. Alteryx Designer Desktop comes with a point-and-click development tool, personal ETL, industry content to enrich applications, data quality tools, R for predictive analytics, and a database. The product works on large volumes of data and can be deployed in the Cloud as well.
Quest Kitenga. Recently purchased by Quest Software (which was recently purchased by Dell), Kitenga is a native Hadoop application that offers visual ETL, Solr-based search, natural language processing, Mahout-based data mining, and advanced visualization capabilities. It's a big data godsend for sophisticated analysts who want a robust toolbox of analytical tools.
Pentaho. Open source BI vendor, Pentaho, offers a complete set of big data tools that extract, transform, load, report, analyze, and explore data in Hadoop. Its new Instaview product enables business analysts to connect to any data source, including Hadoop, HBase, Cassandra, MongoDB, Web data (e.g., Twitter, Facebook, log files, and Web logs) and SQL sources, visually prepare data for analysis, and instantly visualize and explore data.
Other vendors I spoke with included analytics vendors (e.g. SAP, SAS, MetaMarkets, and Revolution Analytics), database vendors (Couchbase, Calpont, and Kognitio), and data fabric vendors (ScaleOut, Terracotta, and Continuity). And this was just the tip of the iceberg! I hope to be able to drill down on these and other vendors' offerings in the near future.
Posted November 8, 2012 3:00 PM
Permalink | 1 Comment |