Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker, and author, specializing in data warehousing, business intelligence, application integration and database technology. He is managing director and founder of R20/Consultancy. He is an internationally acclaimed speaker who has lectured worldwide for the last 25 years. He is the chairman of the successful annual European Business Intelligence and Data Warehouse Conference held in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian, and German . He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Data replication tools have been available since the 1990s.  They have been used primarily to increase the availability and scalability of IT systems and their data. Nowadays, they are also used to replicate data to data warehouses for supporting operational BI. Besides being able to efficiently and non-intrusively replicate data from source to target systems, a powerful feature has always been that they can operate in heterogeneous environments, in which the source and targets are different products. But they have always limited themselves to SQL or SQL-like systems. An intriguing question is whether it is difficult for these data replicators to support the new generation of NoSQL systems? For example, will we be able to use them to replicate data stored in a NoSQL system to a staging area or data warehouse?

Many of the NoSQL systems have built-in data replication features--data is automatically stored multiple times. In fact, the developers can set how many replicas have to be created. However, the replication features of NoSQL systems are limited to a homogenous environment. It's not possible to use these features when, for example, data has to be replicated from a NoSQL system to a classic SQL system.

Today, most data replication products can't replicate from or to NoSQL systems. However, if they can in the future, what will be important is that they handle the non-relational concepts of NoSQL systems efficiently. The keyword here is efficiently. Most existing data replication tools have been designed and optimized to copy data between SQL systems. So, they have been optimized to efficiently process relatively short records with a fixed structure. However, NoSQL records are not always short and fixed with respect to structure. NoSQL systems support a wide range of concepts:

  • Many NoSQL systems, including the key-value stores, the document stores, and the column-family stores support extremely long records. These records can be magnitudes longer than what is common in SQL systems. Current data replicators have been optimized to replicate short records.
  • Almost all NoSQL systems support tables in which each record can have a different structure. This is new for data replication products. For example, what will that do to compression algorithms that assume that all records have the same structure?
  • Document stores and column family stores support hierarchical structures. If that type of data has to be replicated into SQL systems, it has to be flattened somehow. The challenge is to do that very fast. But can it be done fast enough? Data replicators are usually not strong at transformations, because it slows down the replication process too much.
  • Column-family stores support what the relational world used to call repeating groups. The same as for hierarchical structures, how can they be mapped to relational structures by the data replication tools efficiently.
There is no question about whether we need data replication technology to replicate between NoSQL and SQL systems. But the key question is whether it can do this efficiently. This is more than adding one extra source to their list of supported products. It requires a substantial redesign of the internals of these products. This is the challenge these vendors are confronted with in the coming years. Hopefully, they will not claim to support NoSQL, while in fact they only replicate data from NoSQL systems if that data has a relational form.

Note: For more on this topic, see the whitepaper Empowering Operational Business Intelligence with Data Replication.


Posted April 29, 2013 1:16 AM
Permalink | No Comments |
Are you interested in speaking at the Data Warehouse & Business Intelligence European Conference in London coming November? If you are, please fill in the call for speakers.

Previous editions were very successful and attracted more than 200 delegates. Evaluations showed that the attendees were very pleased with the selected speakers, the topics, and setup of the conference.

The 2013 edition is aimed at all aspects of data warehousing and business intelligence, including: trends, design guidelines, product overviews and comparisons, best practices, and new evolving technologies. And like the previous years, the conference is organized together with the highly successful European Data Management and Information Quality Conference.

With this year's call for presentations we are trying to attract proposals for sessions on traditional and future data warehousing and business intelligence aspects. Delegates have expressed a preference for the use of case studies rather than theoretical or abstract topics. We would particularly like practitioners in the field to respond to this call for papers. We encourage new speakers to apply. Success stories - case studies where data warehousing and business intelligence have produced real bottom-line benefits are very much appreciated.

Example topics for proposals are:

  • Agile BI
  • Big data analytics
  • BI in the cloud
  • Data modelling for data warehouses
  • NoSQL in a data warehouse environment
  • The logical data warehouse
  • Data virtualization and data federation
  • The maturity of analytical database servers
  • Star schema, snowflake and data vault models
  • Selling business intelligence to the business
  • The relationship between master data management and data warehousing
  • Guidelines for using ETL tools
  • Operational BI and real-time analytics
  • BAM (Business Activity Monitoring) and KPI (Key Performance Indicators)
  • BI scorecards
  • Customer analytics and insight
  • Text mining and text analytics
  • Open source BI
  • Corporate Performance Management
I am looking forward to your proposal, and hope to see you in London coming November.

Rick F. van der Lans
Chairman of the Data Warehouse & Business Intelligence European Conference 2013


Posted April 15, 2013 8:38 AM
Permalink | No Comments |

About two months ago, Pervasive Software asked me to write a whitepaper describing how well their popular PSQL database server supports Codd's twelve rules for relational databases.

For those not familiar with these rules, in 1985, E.F. (Ted) Codd, the founder of the relational model, defined a set of twelve rules for determining how well a database product supports the relational model. These rules make it possible to answer the question whether a particular product is a relational database server. They were urgently needed, because many vendors were labeling their products as relational, while they were not. So, the term relational became somewhat polluted and Codd wanted to fix and prevent this.

The study was a real trip down memory lane. It was a pleasure to reread all those articles and books written by Codd himself and those by Chris Date on, for example, updatable views. The work they did then, was brilliant. So much of what they wrote, is after so many years, still very true.

After studying Pervasive PSQL in detail, my verdict is that it scores a 10 (on a scale of 0 to 12). Nine rules are fully supported, two partially, and two not. Therefore, the overall conclusion is that PSQL is 83% relationally complete. This is an excellent score and puts PSQL in the list of most relational products.

Is it possible to be 100% relational? The answer is yes. Such products can be developed. In fact, there is one open source product that supports most of the rules: Alphora's DataPhor. However, the product is not (yet) a commercial success. In the same year when Codd introduced the twelve rules, he also wrote "No existing DBMS product that I know of can honestly be claimed to be fully relational, at this time." It looks as if this statement still holds for all the SQL products and probably for most database servers.

Note: Now that Pervasive and Actian have merged, maybe I should write a comparable paper for their Ingres and Vectorwise database server, and see which one is the most relational product.


Posted April 15, 2013 7:40 AM
Permalink | No Comments |
Operational Business Intelligence (OBI) is not a new concept. Although a universally accepted definition of the term doesn't exist, most BI specialists know what it means. It's about presenting operational data to the users. Instead of viewing and analyzing data that is one day or one week old, the data is 100% or close to 100% up to date.

For more and more users OBI is crucial. For example, consider operational management and external parties, such as customers, suppliers, and agents. If we give them access to data to support their decision making processes, in many cases, only operational data is relevant.

But how do we develop BI systems that show operational data? In PowerPoint we can draw an architecture in which BI reports directly access operational databases. And on that PowerPoint slide all seems to work fine. Not in real life, however. Running a BI workload on an operational database can lead to interference, performance degradation, performance instability, and so on. In other words, the operational environment is not going to enjoy this.

This is where data replication can come to the rescue. With data replication we can create and keep a replica of an operational database up to date without interfering with the operational processing. When new data is inserted, updated, or deleted in the original operational database, the replica is updated accordingly and almost instantaneously. This replicated database can then be used for operational reporting and analytics.

Data replication as a technology has been around for a long time, but so far it has been used primarily to increase the availability and/or to distribute the workload of operational systems. My expectation is that data replication will be needed for implementing many new OBI systems. For these products to be ready for BI, besides supporting classic data replication features, such as minimal interference, high throughput, and high availability, they should also support the following three features that are important for BI:

  • Easy to use and easy to maintain: Until now, data replication has been used predominantly in IT departments, and not so much in BI departments or BI Competence Centers. So within these BI groups a minimum of expertise exists with data replication and knowledge on how to embed that technology in BI architectures. Because of this unfamiliarity, it's important that these products are easy to install, easy to manage, and that replication specifications can be changed quickly and easily. A Spartan interface is not appreciated.
  • Heterogeneous data replication: In many organizations the database servers used in these operational environments are different from the ones deployed in their BI environments. Therefore, data replication tools should be able to move data between database servers of different brands.
  • Fast loading into analytical database servers: More and more analytical database servers, such as data warehouse appliances and in-memory database servers, are used to develop data warehouses and/or data marts. These database servers are amazingly fast in running queries. What we don't want is that data is loaded in these products using simple SQL INSERT statements. It will work, but it will be slow. Almost all of these products have specialized interfaces or utilities for fast loading of data. It's vital that data replication products exploit these interfaces or utilities.
To summarize, because of OBI, the need for data replication will increase. It's important that organizations, when they evaluate this technology, study the three features above. For more information on this topic I refer to this whitepaper and this webinar.


Posted March 6, 2013 8:29 PM
Permalink | No Comments |
Quite recently, during my trip to the Silicon Valley in the San Francisco Bay Area, I visited some of the well-known NoSQL vendors. What became obvious after a couple of meetings is that all of them have added or are adding SQL interfaces to their products (or some form of SQL). For example, Cloudera will release Impala, MapR is working on Drill, and DataStax has CQL. These are all SQL interfaces. But the list goes on, data virtualization vendors, such as Composite, Denodo, and Informatic, support access to NoSQL products, Hadapt offers SQL on top of Hadoop, Simba and Quest have released SQL support for various NoSQL products, and MarkLogic, a NoSQL vendor that has developed a search-based transactional database server, will release a SQL interface. In other words, the SQL-fication of NoSQL has started and continues in a rapid pace.

Adding SQL is a wise decision, because through SQL, (big) data stored in these systems, becomes available to a much larger audience and therefore becomes more valuable to the business. It makes it possible to use a much broader set of products to query and analyze that data. Evidently, not all these SQL implementations are perfect today, but I don't doubt that they will improve over time.

Considering this SQL-fication that's going on, how much longer can we state that the term NoSQL stands for NO SQL? Maybe in a few years we will say that NoSQL stands for Not Originally SQL.

In a way, this transformation reminds me of the history of Ingres. This database server started out as a NoSQL product as well. In the beginning, Ingres supported a database language called Quel (a relational language, but not SQL). Eventually, the market forced them to convert to SQL. Not Originally SQL certainly applies to them.

Anyway, the SQL-fication of NoSQL products and big data has started and continues. To me, this is a great development, because more and more organizations understand what a major asset it is. Therefore, data, any data, big or small, should be stored in systems that can be accessed by as many tools, applications, and users as possible, and that's what SQL offers. Such a valuable asset should not be hidden and buried deep in systems that can only be accessed by experts and technical wizards


Posted February 8, 2013 8:53 AM
Permalink | 2 Comments |
PREV 1 2 3 4 5 6