Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

September 2011 Archives

More and more often, Apache's Hadoop is somehow compared to relational databases. In most of those comparisons, Hadoop is presented as a non-relational database, as something that's totally different from classic database servers, such as IBM's DB2, Microsoft SQL Server, and Oracle11g. Comparing Hadoop this way makes no sense. Hadoop can be as relational as those classic database servers.

 

Whether a system is relational does not depend on how data is stored on disk, but fully depends on how the data is perceived by the applications. It depends on what language and/or API the applications use to insert, query, and manipulate the data.

 

In a nutshell, when the relational model was defined and introduced by Tedd Codd, Chris Date and others, the rule was that if a system could present all the data as tables and columns, and if that data could be accessed through a language supporting relational operators such as join, select, and project, that system was a relational system. Tedd Codd called this data independency; an application should not be concerned with how the data was physically stored.

 

What this means is that if a system offers an interface where data is presented as tables and that supports those relational operators, it offers a relational interface. For example, if a system supports a SQL interface on a dataset, that system can be classified as relational. Note: I am aware that SQL does not adhere to all the rules needed to offer a relational interface, but for the sake of simplicity I will regard SQL as relational.

 

We have to make a distinction between, on one hand, the storage model and the storage engine, and, on the other hand, the interface the applications use. Let's call the latter that the interface model. Whether something can be qualified as a relational depends on that interface model and not on the storage model. Whether data in stored as records, in a column-oriented fashion, in a key-value store, or, if possible, in a fish bowl, is irrelevant. The storage model does not determine whether a system is relational or not.

 

Hadoop's HDFS uses a very specific storage model and unique storage engine that are both different from what the classic database servers have implemented. And of course, if we would access the interface of HDFS directly, we wouldn't see a interface that could be called relational, but a very technical low-level interface instead. However, if we would use HiveQL to access the data stored in Hadoop, or if we would use a data virtualization server such as Composite Information Server or Informatica Data Services running on top of Hadoop, in both cases the Hadoop database would be accessed in a relational way, meaning it would become a relational system to the applications.

 

This is not very different from accessing classic relational databases. If we access data via the standard SQL interface, they are relational systems. However, if it would be possible to develop an application that accesses the data by directly accessing their internal storage engines, the same data wouldn't look that relational anymore. By the way, those database servers don't always store the data as records either. For example, in Oracle data can be presented as tables while in fact it's stored as a multi-dimensional cube. And in Sybase IQ data is presented as tables but is stored in a column-oriented fashion using pointer structures.

 

To summarize, whether a system is relational is not dependent on the storage model, but on the language and/or API used to access the data. The same data set can be presented as relational to one application and as not-relational to the other. Hadoop offers a special storage model, but that doesn't mean that data can not be presented in a relational way. In fact, the same applies to most of those new so-called NoSQL database servers.

 

To come back to the comparisons, it would make sense to compare the storage models of Hadoop with those of other database servers, and it would make sense to compare the interfaces of Hadoop with the interfaces of other database servers. But a comparison of the Hadoop storage model with the interface model of classic database servers, is like comparing apples and pears.


Posted September 12, 2011 9:26 AM
Permalink | No Comments |