We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

This year at the Strata Conference in Santa Clara, CA it was very clear: The Battle of the SQL-on-Hadoop engines is underway. Many existing and new vendors presented their solutions at the exhibition and many sessions were dedicated to this topic.

As popular as NoSQL was a year ago, so popular is SQL-on-Hadoop today. Here are some of the many implementations: Apache Hive, CitusDB, Cloudera Impala, Concurrency Lingual, Hadapt, InfiniDB, JethroData, MammothDB, MapR Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba, and SpliceMachine.

Besides these implementations, we should also include all the data virtualization products that are designed to access all kinds of data sources including Hadoop and to integrate data from different data sources. Examples are Cirro, Cisco/Composite, Denodo, Informatica IDS, RedHat JBoss Data Virtualization, and Stonebond.

And, of course, we have a few SQL database servers that support polyglot persistence. This means that they can store their data in their own native SQL database or in Hahoop. Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft Polybase, Actian Paraccell, and Teradata Aster database (SQL-H).

Most of these implementations are currently restricted to query the data stored in Hadoop, but some, such as SpliceMachine, support transactions on Hadoop. Most of them don't work with indexes, although JethroData does.

This attention for SQL-on-Hadoop makes a lot of sense. By making all the big data stored in HDFS available through a SQL interface makes it accessible for numerous reporting and analytical tools. It makes big data available for the masses. It's not only for the happy few anymore who are good at programming Java.

If you're interested in SQL-on-Hadoop, you have to study at least two technical aspects. First, how efficient are these engines when executing joins?  Especially joining multiple big tables is a technological challenge. Second, running one query fast is relatively easy, but how well do these engines manage their workload if multiple queries with different characteristics have to be executed concurrently? In other words, how well does the engine manage the query workload? Can one resource-hungry query consume all the available resources, making all the other queries wait? So, don't be influenced too much by single-user benchmarks.

It's easy to predict that we will see many more of these SQL-on-Hadoop implementations entering this market. That the existing products will improve and become faster is evident. The big question is which of them will survive this battle? That not all of them will be commercially successful is evident, but for customers it's important that a selected product still exists after a few years. This is hard to predict today, because the market is still rapidly evolving. Let's see what the status is of this large group of products next year at Strata.

Posted February 20, 2014 1:46 AM
Permalink | No Comments |

Leave a comment