Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Lately, many SQL-on-Hadoop query engines were released, including Drill, Hawq, IBM BigSQL, Impala, and Facebook Presto. This new generation of SQL-on-Hadoop query engines can be compared with classic SQL database servers in many ways. For example, one can study the richness of the SQL dialect, the transaction mechanism, the quality of the query optimizer, or the query performance. Especially the latter is done quite regularly.

In the long run, what's more interesting is the independence between, on one hand, the SQL-on-Hadoop query engines and, on the other hand, the file systems and file storage formats. This is a difference overlooked by many.

Almost all classic SQL database servers come with their own file systems and file formats. That means that we can't develop a database file using, for example, Oracle, and then later on access that same file using DB2. To be able to do that, the data has to exported from the Oracle database and imported in the DB2 database next. In these systems, the SQL query engine plus the file system plus the file format form one indivisible unit. In a way, it's a proprietary stack of software for data storage, data manipulation, and data access.

One of the big advantages of most of the new SQL-on-Hadoop query engines is the independence between the three layers. Query engines, file systems and file formats are interchangeable. The consequence is that data can be inserted using, for example, Impala, and accessed afterwards using Drill. Or, a SQL query engine can seamlessly switch from the Apache HDFS file system to the MapR file system or to the Amazon S3 file system.
The big advantage is that this allows us to deploy different query engines, each with its own strengths and weaknesses, on the same data. There is no need to copy and duplicate the data. For example, for one group of users a SQL engine can be selected that is designed to support high-end, complex analytical queries, and for the other group an engine that's optimized for more simple interactive reporting.

It's comparable to having one flat screen that can be used as TV, computer screen, and as projector screen, instead of having three separate screens.

This independence between the layers is a crucial difference between SQL-on-Hadoop query engines and classic SQL database servers. However, it's not getting the attention it deserves yet. The advantages of it are quite often overlooked. It's a bigger advantage than most people think, especially in the world of big data, where duplicating complete files can be very time-consuming and expensive.


Posted January 27, 2014 1:18 AM
Permalink | 3 Comments |

3 Comments

This is spot on. This flexibility - to use the same data in multiple ways, without moving it - is a core value proposition of the Apache Hadoop ecosystem. Even more interesting is the ability to go beyond SQL engines, to apply enterprise search, machine learning, stream processing, and so on. That's pretty important when you're talking about data volumes that simply aren't cost effective to replicate and move and remodel for every business process or question.

At Cloudera (where I work) we also like an analogy involving the iPhone (or other smartphone), as compared to an SLR camera. The SLR takes the best photos, but I'd bet that you take 95%+ of your photos with your smartphone for the convenience of multiple uses - you can take a photo, apply filters, and share it with friends and family, all from within one portable device; and it's also your GPS, phone, notepad, etc.

Increasingly enterprises are moving to a world where this flexibility is a key requirement for new data management infrastructure.

Great post on a very hot topic with a lot of confusion. It is still early days for SQL-on-Hadoop and giving customers flexibility for multiple techniques without lock-in is important as the market evolves.

MapR supports an "open SQL-on-Hadoop" approach, focusing on the performance and reliability across ALL of these approaches. This gives more choice and faster interactive SQL to customers as they move past experimentation with Hadoop and into serious production use where business SLA's matter.

It also reopens the classic debate of the optimization of the vendor proprietary stack, where all layers are tuned for each other, and the openness (and freedom of choice) of best-of-breed, where the onus is on making the interfaces as seamless as possible.

Leave a comment