Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Lately, many SQL-on-Hadoop query engines were released, including Drill, Hawq, IBM BigSQL, Impala, and Facebook Presto. This new generation of SQL-on-Hadoop query engines can be compared with classic SQL database servers in many ways. For example, one can study the richness of the SQL dialect, the transaction mechanism, the quality of the query optimizer, or the query performance. Especially the latter is done quite regularly.

In the long run, what's more interesting is the independence between, on one hand, the SQL-on-Hadoop query engines and, on the other hand, the file systems and file storage formats. This is a difference overlooked by many.

Almost all classic SQL database servers come with their own file systems and file formats. That means that we can't develop a database file using, for example, Oracle, and then later on access that same file using DB2. To be able to do that, the data has to exported from the Oracle database and imported in the DB2 database next. In these systems, the SQL query engine plus the file system plus the file format form one indivisible unit. In a way, it's a proprietary stack of software for data storage, data manipulation, and data access.

One of the big advantages of most of the new SQL-on-Hadoop query engines is the independence between the three layers. Query engines, file systems and file formats are interchangeable. The consequence is that data can be inserted using, for example, Impala, and accessed afterwards using Drill. Or, a SQL query engine can seamlessly switch from the Apache HDFS file system to the MapR file system or to the Amazon S3 file system.
The big advantage is that this allows us to deploy different query engines, each with its own strengths and weaknesses, on the same data. There is no need to copy and duplicate the data. For example, for one group of users a SQL engine can be selected that is designed to support high-end, complex analytical queries, and for the other group an engine that's optimized for more simple interactive reporting.

It's comparable to having one flat screen that can be used as TV, computer screen, and as projector screen, instead of having three separate screens.

This independence between the layers is a crucial difference between SQL-on-Hadoop query engines and classic SQL database servers. However, it's not getting the attention it deserves yet. The advantages of it are quite often overlooked. It's a bigger advantage than most people think, especially in the world of big data, where duplicating complete files can be very time-consuming and expensive.


Posted January 27, 2014 1:18 AM
Permalink | 3 Comments |
A message to mapmakers: highways are not painted red, rivers don't have county lines running down the middle, and you don't see contour lines on a mountain. This is how William Kent starts his book "Data and Reality". The first edition was published in 1978; a long, long time ago. It was the time when people like Chris Date, Ted Codd and Ron Fagin were coming up with normalization and the accompanying normal forms. It was prime time for relational theory.

I bought the book in 1981. The edition I have is old. It uses a non-proportional  font and you can clearly see that typesetting was done with Word, which at that time was the ultimate word processing software tool.

I still consider this book to be the best book on data modeling. Many other great books have been published, but this is still my favorite. One of the reasons I like it is because it addresses very fundamental data modeling questions, such as "Is an object still the same object when all its parts have been replaced?" and "What is really the difference between attributes and relationships?"

Maybe you have never heard of this book or William Kent, but you may possibly know his simple way to remember normalization: "[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key."

I have always recommended this book to the attendees of my data modeling sessions, even though I knew the book was hard to get. The good news is that a new edition had been republished. It includes comments by Steve Hoberman and a note by Chris Date. If you haven't read this book yet, this is your chance. A must-read if you're into data modeling.


Posted January 8, 2014 1:21 AM
Permalink | No Comments |
Where a data scientist or analyst will find an answer to his quest, is not always that obvious beforehand. For example, when he is looking for the dominant factor influencing sales of particular products, when he tries to find the way to increase the customer care level, or when he tries to establish what the risk level is when car insurances are sold to young people, he may not have any idea what the answer may be. He may not know which data sets are needed to come up with an answer, or which data items he has to study upfront.

Therefore, he needs tools that allow him to freely explore and investigate data. Incorporating more data sets in the analysis should be very easy, there should be no need to specify a goal beforehand, and it must be possible to analyze data in an unguided way.
Besides all the more standard features such as displaying data as bar charts, in dashboards, on geographical maps, the perfect tool for this type of work should support at least the following characteristics:

  • No advance preparations: There should be no need to predefine data structures of the analysis work in advance. If data is available, it should be possible to load it without any preparations, even if it concerns a new type of data.
  • Unguided analysis: Analysts should be able to invoke the analysis technology without having to specify a goal in advance. The exploration technology should allow for analyzing data in an unguided style.
  • Self-service: Analysts should be able to use the analysis techniques without help from IT experts.
Connexica's analysis tool called CXAIR is such a tool. It's natural language interface, venn-diagramming techniques, and visualization features allow users to freely query and analyze data. No cubes or star schemas have to be defined on forehand (which would limit the analysis capabilities).

CXAIR internally organizes all the data using an intelligent index. In fact, internally it's based on text-search technology. This makes it possible to combine and relate data without any form of restriction, which is what analysts need.

Unlike most analysis tools, CXAIR uses search technology that speeds up data analysis. For calculations a mixture of in-memory and on-disk caching is used to analyse massive amounts of data at search engine speeds. All the loaded data resides on the server as it provides a thin client web interface. In other words, no data is loaded on the client machine. Numbers are cached on the server but not text. The fact that CXAIR doesn't cache all the data means that available memory is not a restriction. Cache is used to improve the performance, but large internal memory is not a necessity but will help performance, particularly for ad-hoc calculations.

CXAIR is clearly a representative of a new generation of reporting/analysis tools that users can deploy to freely analyze data. It's a tool for self-service discovery  and investigation of data. It's the tool that many data scientists have been waiting for and is worth checking out.


Posted December 10, 2013 6:52 AM
Permalink | 1 Comment |
Data replication tools have been available since the 1990s.  They have been used primarily to increase the availability and scalability of IT systems and their data. Nowadays, they are also used to replicate data to data warehouses for supporting operational BI. Besides being able to efficiently and non-intrusively replicate data from source to target systems, a powerful feature has always been that they can operate in heterogeneous environments, in which the source and targets are different products. But they have always limited themselves to SQL or SQL-like systems. An intriguing question is whether it is difficult for these data replicators to support the new generation of NoSQL systems? For example, will we be able to use them to replicate data stored in a NoSQL system to a staging area or data warehouse?

Many of the NoSQL systems have built-in data replication features--data is automatically stored multiple times. In fact, the developers can set how many replicas have to be created. However, the replication features of NoSQL systems are limited to a homogenous environment. It's not possible to use these features when, for example, data has to be replicated from a NoSQL system to a classic SQL system.

Today, most data replication products can't replicate from or to NoSQL systems. However, if they can in the future, what will be important is that they handle the non-relational concepts of NoSQL systems efficiently. The keyword here is efficiently. Most existing data replication tools have been designed and optimized to copy data between SQL systems. So, they have been optimized to efficiently process relatively short records with a fixed structure. However, NoSQL records are not always short and fixed with respect to structure. NoSQL systems support a wide range of concepts:

  • Many NoSQL systems, including the key-value stores, the document stores, and the column-family stores support extremely long records. These records can be magnitudes longer than what is common in SQL systems. Current data replicators have been optimized to replicate short records.
  • Almost all NoSQL systems support tables in which each record can have a different structure. This is new for data replication products. For example, what will that do to compression algorithms that assume that all records have the same structure?
  • Document stores and column family stores support hierarchical structures. If that type of data has to be replicated into SQL systems, it has to be flattened somehow. The challenge is to do that very fast. But can it be done fast enough? Data replicators are usually not strong at transformations, because it slows down the replication process too much.
  • Column-family stores support what the relational world used to call repeating groups. The same as for hierarchical structures, how can they be mapped to relational structures by the data replication tools efficiently.
There is no question about whether we need data replication technology to replicate between NoSQL and SQL systems. But the key question is whether it can do this efficiently. This is more than adding one extra source to their list of supported products. It requires a substantial redesign of the internals of these products. This is the challenge these vendors are confronted with in the coming years. Hopefully, they will not claim to support NoSQL, while in fact they only replicate data from NoSQL systems if that data has a relational form.

Note: For more on this topic, see the whitepaper Empowering Operational Business Intelligence with Data Replication.


Posted April 29, 2013 1:16 AM
Permalink | No Comments |
Are you interested in speaking at the Data Warehouse & Business Intelligence European Conference in London coming November? If you are, please fill in the call for speakers.

Previous editions were very successful and attracted more than 200 delegates. Evaluations showed that the attendees were very pleased with the selected speakers, the topics, and setup of the conference.

The 2013 edition is aimed at all aspects of data warehousing and business intelligence, including: trends, design guidelines, product overviews and comparisons, best practices, and new evolving technologies. And like the previous years, the conference is organized together with the highly successful European Data Management and Information Quality Conference.

With this year's call for presentations we are trying to attract proposals for sessions on traditional and future data warehousing and business intelligence aspects. Delegates have expressed a preference for the use of case studies rather than theoretical or abstract topics. We would particularly like practitioners in the field to respond to this call for papers. We encourage new speakers to apply. Success stories - case studies where data warehousing and business intelligence have produced real bottom-line benefits are very much appreciated.

Example topics for proposals are:

  • Agile BI
  • Big data analytics
  • BI in the cloud
  • Data modelling for data warehouses
  • NoSQL in a data warehouse environment
  • The logical data warehouse
  • Data virtualization and data federation
  • The maturity of analytical database servers
  • Star schema, snowflake and data vault models
  • Selling business intelligence to the business
  • The relationship between master data management and data warehousing
  • Guidelines for using ETL tools
  • Operational BI and real-time analytics
  • BAM (Business Activity Monitoring) and KPI (Key Performance Indicators)
  • BI scorecards
  • Customer analytics and insight
  • Text mining and text analytics
  • Open source BI
  • Corporate Performance Management
I am looking forward to your proposal, and hope to see you in London coming November.

Rick F. van der Lans
Chairman of the Data Warehouse & Business Intelligence European Conference 2013


Posted April 15, 2013 8:38 AM
Permalink | No Comments |