In the long run, what's more interesting is the independence between, on one hand, the SQL-on-Hadoop query engines and, on the other hand, the file systems and file storage formats. This is a difference overlooked by many.
Almost all classic SQL database servers come with their own file systems and file formats. That means that we can't develop a database file using, for example, Oracle, and then later on access that same file using DB2. To be able to do that, the data has to exported from the Oracle database and imported in the DB2 database next. In these systems, the SQL query engine plus the file system plus the file format form one indivisible unit. In a way, it's a proprietary stack of software for data storage, data manipulation, and data access.
One of the big advantages of most of the new SQL-on-Hadoop query engines is the independence between the three layers. Query engines, file systems and file formats are interchangeable. The consequence is that data can be inserted using, for example, Impala, and accessed afterwards using Drill. Or, a SQL query engine can seamlessly switch from the Apache HDFS file system to the MapR file system or to the Amazon S3 file system.
The big advantage is that this allows us to deploy different query engines, each with its own strengths and weaknesses, on the same data. There is no need to copy and duplicate the data. For example, for one group of users a SQL engine can be selected that is designed to support high-end, complex analytical queries, and for the other group an engine that's optimized for more simple interactive reporting.
It's comparable to having one flat screen that can be used as TV, computer screen, and as projector screen, instead of having three separate screens.
This independence between the layers is a crucial difference between SQL-on-Hadoop query engines and classic SQL database servers. However, it's not getting the attention it deserves yet. The advantages of it are quite often overlooked. It's a bigger advantage than most people think, especially in the world of big data, where duplicating complete files can be very time-consuming and expensive.
Posted January 27, 2014 1:18 AM
Permalink | 3 Comments |
This is spot on. This flexibility - to use the same data in multiple ways, without moving it - is a core value proposition of the Apache Hadoop ecosystem. Even more interesting is the ability to go beyond SQL engines, to apply enterprise search, machine learning, stream processing, and so on. That's pretty important when you're talking about data volumes that simply aren't cost effective to replicate and move and remodel for every business process or question.
At Cloudera (where I work) we also like an analogy involving the iPhone (or other smartphone), as compared to an SLR camera. The SLR takes the best photos, but I'd bet that you take 95%+ of your photos with your smartphone for the convenience of multiple uses - you can take a photo, apply filters, and share it with friends and family, all from within one portable device; and it's also your GPS, phone, notepad, etc.
Increasingly enterprises are moving to a world where this flexibility is a key requirement for new data management infrastructure.
Great post on a very hot topic with a lot of confusion. It is still early days for SQL-on-Hadoop and giving customers flexibility for multiple techniques without lock-in is important as the market evolves.
MapR supports an "open SQL-on-Hadoop" approach, focusing on the performance and reliability across ALL of these approaches. This gives more choice and faster interactive SQL to customers as they move past experimentation with Hadoop and into serious production use where business SLA's matter.
It also reopens the classic debate of the optimization of the vendor proprietary stack, where all layers are tuned for each other, and the openness (and freedom of choice) of best-of-breed, where the onus is on making the interfaces as seamless as possible.