In the long run, what's more interesting is the independence between, on one hand, the SQL-on-Hadoop query engines and, on the other hand, the file systems and file storage formats. This is a difference overlooked by many.
Almost all classic SQL database servers come with their own file systems and file formats. That means that we can't develop a database file using, for example, Oracle, and then later on access that same file using DB2. To be able to do that, the data has to exported from the Oracle database and imported in the DB2 database next. In these systems, the SQL query engine plus the file system plus the file format form one indivisible unit. In a way, it's a proprietary stack of software for data storage, data manipulation, and data access.
One of the big advantages of most of the new SQL-on-Hadoop query engines is the independence between the three layers. Query engines, file systems and file formats are interchangeable. The consequence is that data can be inserted using, for example, Impala, and accessed afterwards using Drill. Or, a SQL query engine can seamlessly switch from the Apache HDFS file system to the MapR file system or to the Amazon S3 file system.
The big advantage is that this allows us to deploy different query engines, each with its own strengths and weaknesses, on the same data. There is no need to copy and duplicate the data. For example, for one group of users a SQL engine can be selected that is designed to support high-end, complex analytical queries, and for the other group an engine that's optimized for more simple interactive reporting.
It's comparable to having one flat screen that can be used as TV, computer screen, and as projector screen, instead of having three separate screens.
This independence between the layers is a crucial difference between SQL-on-Hadoop query engines and classic SQL database servers. However, it's not getting the attention it deserves yet. The advantages of it are quite often overlooked. It's a bigger advantage than most people think, especially in the world of big data, where duplicating complete files can be very time-consuming and expensive.
Posted January 27, 2014 1:18 AM
Permalink | 3 Comments |