We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

January 2014 Archives

Lately, many SQL-on-Hadoop query engines were released, including Drill, Hawq, IBM BigSQL, Impala, and Facebook Presto. This new generation of SQL-on-Hadoop query engines can be compared with classic SQL database servers in many ways. For example, one can study the richness of the SQL dialect, the transaction mechanism, the quality of the query optimizer, or the query performance. Especially the latter is done quite regularly.

In the long run, what's more interesting is the independence between, on one hand, the SQL-on-Hadoop query engines and, on the other hand, the file systems and file storage formats. This is a difference overlooked by many.

Almost all classic SQL database servers come with their own file systems and file formats. That means that we can't develop a database file using, for example, Oracle, and then later on access that same file using DB2. To be able to do that, the data has to exported from the Oracle database and imported in the DB2 database next. In these systems, the SQL query engine plus the file system plus the file format form one indivisible unit. In a way, it's a proprietary stack of software for data storage, data manipulation, and data access.

One of the big advantages of most of the new SQL-on-Hadoop query engines is the independence between the three layers. Query engines, file systems and file formats are interchangeable. The consequence is that data can be inserted using, for example, Impala, and accessed afterwards using Drill. Or, a SQL query engine can seamlessly switch from the Apache HDFS file system to the MapR file system or to the Amazon S3 file system.
The big advantage is that this allows us to deploy different query engines, each with its own strengths and weaknesses, on the same data. There is no need to copy and duplicate the data. For example, for one group of users a SQL engine can be selected that is designed to support high-end, complex analytical queries, and for the other group an engine that's optimized for more simple interactive reporting.

It's comparable to having one flat screen that can be used as TV, computer screen, and as projector screen, instead of having three separate screens.

This independence between the layers is a crucial difference between SQL-on-Hadoop query engines and classic SQL database servers. However, it's not getting the attention it deserves yet. The advantages of it are quite often overlooked. It's a bigger advantage than most people think, especially in the world of big data, where duplicating complete files can be very time-consuming and expensive.


Posted January 27, 2014 1:18 AM
Permalink | 3 Comments |
A message to mapmakers: highways are not painted red, rivers don't have county lines running down the middle, and you don't see contour lines on a mountain. This is how William Kent starts his book "Data and Reality". The first edition was published in 1978; a long, long time ago. It was the time when people like Chris Date, Ted Codd and Ron Fagin were coming up with normalization and the accompanying normal forms. It was prime time for relational theory.

I bought the book in 1981. The edition I have is old. It uses a non-proportional  font and you can clearly see that typesetting was done with Word, which at that time was the ultimate word processing software tool.

I still consider this book to be the best book on data modeling. Many other great books have been published, but this is still my favorite. One of the reasons I like it is because it addresses very fundamental data modeling questions, such as "Is an object still the same object when all its parts have been replaced?" and "What is really the difference between attributes and relationships?"

Maybe you have never heard of this book or William Kent, but you may possibly know his simple way to remember normalization: "[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key."

I have always recommended this book to the attendees of my data modeling sessions, even though I knew the book was hard to get. The good news is that a new edition had been republished. It includes comments by Steve Hoberman and a note by Chris Date. If you haven't read this book yet, this is your chance. A must-read if you're into data modeling.


Posted January 8, 2014 1:21 AM
Permalink | No Comments |