This year at the
Strata Conference in Santa Clara, CA it was very clear: The Battle of the SQL-on-Hadoop engines is underway. Many existing and new vendors presented their solutions at the exhibition and many sessions were dedicated to this topic.
As popular as NoSQL was a year ago, so popular is SQL-on-Hadoop today. Here are some of the many implementations: Apache Hive, CitusDB, Cloudera Impala, Concurrency Lingual, Hadapt, InfiniDB, JethroData, MammothDB, MapR Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba, and SpliceMachine.
Besides these implementations, we should also include all the data virtualization products that are designed to access all kinds of data sources including Hadoop and to integrate data from different data sources. Examples are Cirro, Cisco/Composite, Denodo, Informatica IDS, RedHat JBoss Data Virtualization, and Stonebond.
And, of course, we have a few SQL database servers that support polyglot persistence. This means that they can store their data in their own native SQL database or in Hahoop. Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft Polybase, Actian Paraccell, and Teradata Aster database (SQL-H).
Most of these implementations are currently restricted to query the data stored in Hadoop, but some, such as SpliceMachine, support transactions on Hadoop. Most of them don't work with indexes, although JethroData does.
This attention for SQL-on-Hadoop makes a lot of sense. By making all the big data stored in HDFS available through a SQL interface makes it accessible for numerous reporting and analytical tools. It makes big data available for the masses. It's not only for the happy few anymore who are good at programming Java.
If you're interested in SQL-on-Hadoop, you have to study at least two technical aspects. First, how efficient are these engines when executing joins? Especially joining multiple big tables is a technological challenge. Second, running one query fast is relatively easy, but how well do these engines manage their workload if multiple queries with different characteristics have to be executed concurrently? In other words, how well does the engine manage the query workload? Can one resource-hungry query consume all the available resources, making all the other queries wait? So, don't be influenced too much by single-user benchmarks.
It's easy to predict that we will see many more of these SQL-on-Hadoop implementations entering this market. That the existing products will improve and become faster is evident. The big question is which of them will survive this battle? That not all of them will be commercially successful is evident, but for customers it's important that a selected product still exists after a few years. This is hard to predict today, because the market is still rapidly evolving. Let's see what the status is of this large group of products next year at Strata.