We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

"Prediction is very difficult, especially if it's about the future." These were the famous words of Nils Bohr, Nobel laureate in Physics. I discovered how true he was when organizing a large pile of  old IT magazines in my office. I found a magazine published by Fawcette Technical Publications. It was a special issue to celebrate their 10th anniversary, and was published in the winter of 2000/2001. The issue was called "The Future of Software" and contained more than 30 short articles written by a long list of experts.

Some of the authors were dead right with their predictions:

  • Avi Silberschatz: Accessing continuously growing database with zero-latency requires new solutions, such as main-memory database systems.
  • Richard Schwartz: The future is for the 10-second application; the challenge is to develop applications requiring an absolute minimal level of keystrokes ...
  • Kent Beck: In the next 10 years ... we jump to a world where we see software development as a conversation between business and technology.

And some were less lucky (I will leave it to the readers to determine which ones were correct):

  • Eric Bonabeau: Swarm intelligence will become a new model for computing.
  • Mitchell Kertzman: The future of software is to be platform-independent, not just OS-independent, but device-independent, network-transport independent ...
  • Thomas Vayda: By 2003, 70% of all new applications will be built primarily form "building blocks" ...

Interesting enough is that the big changes in the IT industry of the last few years were not mentioned at all, such as:

  • Big data
  • Smartphones
  • Social media networks

This shows how hard it really is to predict the future of the IT industry. Maybe it's better to listen to Sir Winston Churchill: "I always avoid prophesying beforehand because it is much better to prophesy after the event has already taken place."

Posted March 2, 2014 11:29 AM
Permalink | No Comments |
If one thing became clear to me at the Strata Conference this month was that the popularity of Hadoop is unmistakable and that SQL-on-Hadoop follows closely in its footsteps. A SQL-on-Hadoop engine makes it possible to access big data, stored in Hadoop HDFS or HBase, using the language so familiar to many developers, namely SQL. SQL-on-Hadoop also makes it easier for popular reporting and analytical tools to access big data in Hadoop.

Tools that have been offering access to non-SQL data sources using SQL for a long time are the data virtualization servers. Most of them allow SQL access to data stored in spreadsheets, XML documents, sequential files, pre-relational database servers, data hidden behind APIs such as SOAP and REST, and data stored in applications such as SAP and Salesforce.com.

Most of the current SQL-on-Hadoop engines offer only SQL query access to one or two data sources: HDFS and HBase. Sounds easy, but it's not. The technical problem they have to solve is how to turn all the non-relational data stored in Hadoop, such as, variable data, self-describing data, and schema-less data , into flat relational structures.

However, the question is whether offering query capabilities on Hadoop is sufficient, because the bar is being raised for SQL-on-Hadoop engines. Some, such as SpliceMachine, offer transactional support on Hadoop in addition to the queries. Others, such as Cirro and ScleraDB, support data federation: data stored in SQL databases can be joined with Hadoop data. So, maybe offering SQL query capabilities on Hadoop will not be enough anymore in the near future.

Data virtualization servers have started to offer access to Hadoop as well, and with that they have entered the market of SQL-on-Hadoop engines. When they do, they will raise the bar for SQL-on-Hadoop engines even more. Current data virtualization servers are not simply runtime engines that offer SQL access to various data sources. Most of them also offer data federation capabilities for many non-SQL data sources , a high-level design and modeling environment with lineage and impact analysis features, caching capabilities to minimize access of the data source, distributed join optimization techniques, and data security features.

In the near future, SQL-on-Hadoop engines are expected to be extended with these typical data virtualization features. And data virtualization servers will have to enrich themselves with full-blown support for Hadoop. But whatever happens, the two markets will slowly converge into one. Products will merge together and others will be extended. This is definitely a market to keep an eye on in the coming years.

Posted February 24, 2014 3:39 AM
Permalink | 2 Comments |
This year at the Strata Conference in Santa Clara, CA it was very clear: The Battle of the SQL-on-Hadoop engines is underway. Many existing and new vendors presented their solutions at the exhibition and many sessions were dedicated to this topic.

As popular as NoSQL was a year ago, so popular is SQL-on-Hadoop today. Here are some of the many implementations: Apache Hive, CitusDB, Cloudera Impala, Concurrency Lingual, Hadapt, InfiniDB, JethroData, MammothDB, MapR Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba, and SpliceMachine.

Besides these implementations, we should also include all the data virtualization products that are designed to access all kinds of data sources including Hadoop and to integrate data from different data sources. Examples are Cirro, Cisco/Composite, Denodo, Informatica IDS, RedHat JBoss Data Virtualization, and Stonebond.

And, of course, we have a few SQL database servers that support polyglot persistence. This means that they can store their data in their own native SQL database or in Hahoop. Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft Polybase, Actian Paraccell, and Teradata Aster database (SQL-H).

Most of these implementations are currently restricted to query the data stored in Hadoop, but some, such as SpliceMachine, support transactions on Hadoop. Most of them don't work with indexes, although JethroData does.

This attention for SQL-on-Hadoop makes a lot of sense. By making all the big data stored in HDFS available through a SQL interface makes it accessible for numerous reporting and analytical tools. It makes big data available for the masses. It's not only for the happy few anymore who are good at programming Java.

If you're interested in SQL-on-Hadoop, you have to study at least two technical aspects. First, how efficient are these engines when executing joins?  Especially joining multiple big tables is a technological challenge. Second, running one query fast is relatively easy, but how well do these engines manage their workload if multiple queries with different characteristics have to be executed concurrently? In other words, how well does the engine manage the query workload? Can one resource-hungry query consume all the available resources, making all the other queries wait? So, don't be influenced too much by single-user benchmarks.

It's easy to predict that we will see many more of these SQL-on-Hadoop implementations entering this market. That the existing products will improve and become faster is evident. The big question is which of them will survive this battle? That not all of them will be commercially successful is evident, but for customers it's important that a selected product still exists after a few years. This is hard to predict today, because the market is still rapidly evolving. Let's see what the status is of this large group of products next year at Strata.

Posted February 20, 2014 1:46 AM
Permalink | No Comments |
Lately, many SQL-on-Hadoop query engines were released, including Drill, Hawq, IBM BigSQL, Impala, and Facebook Presto. This new generation of SQL-on-Hadoop query engines can be compared with classic SQL database servers in many ways. For example, one can study the richness of the SQL dialect, the transaction mechanism, the quality of the query optimizer, or the query performance. Especially the latter is done quite regularly.

In the long run, what's more interesting is the independence between, on one hand, the SQL-on-Hadoop query engines and, on the other hand, the file systems and file storage formats. This is a difference overlooked by many.

Almost all classic SQL database servers come with their own file systems and file formats. That means that we can't develop a database file using, for example, Oracle, and then later on access that same file using DB2. To be able to do that, the data has to exported from the Oracle database and imported in the DB2 database next. In these systems, the SQL query engine plus the file system plus the file format form one indivisible unit. In a way, it's a proprietary stack of software for data storage, data manipulation, and data access.

One of the big advantages of most of the new SQL-on-Hadoop query engines is the independence between the three layers. Query engines, file systems and file formats are interchangeable. The consequence is that data can be inserted using, for example, Impala, and accessed afterwards using Drill. Or, a SQL query engine can seamlessly switch from the Apache HDFS file system to the MapR file system or to the Amazon S3 file system.
The big advantage is that this allows us to deploy different query engines, each with its own strengths and weaknesses, on the same data. There is no need to copy and duplicate the data. For example, for one group of users a SQL engine can be selected that is designed to support high-end, complex analytical queries, and for the other group an engine that's optimized for more simple interactive reporting.

It's comparable to having one flat screen that can be used as TV, computer screen, and as projector screen, instead of having three separate screens.

This independence between the layers is a crucial difference between SQL-on-Hadoop query engines and classic SQL database servers. However, it's not getting the attention it deserves yet. The advantages of it are quite often overlooked. It's a bigger advantage than most people think, especially in the world of big data, where duplicating complete files can be very time-consuming and expensive.

Posted January 27, 2014 1:18 AM
Permalink | 3 Comments |
A message to mapmakers: highways are not painted red, rivers don't have county lines running down the middle, and you don't see contour lines on a mountain. This is how William Kent starts his book "Data and Reality". The first edition was published in 1978; a long, long time ago. It was the time when people like Chris Date, Ted Codd and Ron Fagin were coming up with normalization and the accompanying normal forms. It was prime time for relational theory.

I bought the book in 1981. The edition I have is old. It uses a non-proportional  font and you can clearly see that typesetting was done with Word, which at that time was the ultimate word processing software tool.

I still consider this book to be the best book on data modeling. Many other great books have been published, but this is still my favorite. One of the reasons I like it is because it addresses very fundamental data modeling questions, such as "Is an object still the same object when all its parts have been replaced?" and "What is really the difference between attributes and relationships?"

Maybe you have never heard of this book or William Kent, but you may possibly know his simple way to remember normalization: "[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key."

I have always recommended this book to the attendees of my data modeling sessions, even though I knew the book was hard to get. The good news is that a new edition had been republished. It includes comments by Steve Hoberman and a note by Chris Date. If you haven't read this book yet, this is your chance. A must-read if you're into data modeling.

Posted January 8, 2014 1:21 AM
Permalink | No Comments |