Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

First, we had the data warehouse and the data mart, now we also have the data lake, the data reservoir, and the data hub. These new architectures have all been introduced recently and have a direct relationship with the ever more popular Hadoop technology. But don't the fundamentals of Hadoop clash with those of these architectures? Is the technology not inconsistent with the architectures?

Forbes magazine defined data lake by comparing it to a data warehouse: "The difference between a data lake and a data warehouse is that the data in a data warehouse is pre-categorized at the point of entry, which can dictate how it's going to be analyzed." Data hub is a term used by various big data vendors.  In a recent tweet, Clarke Patterson summarized it as follows: data lake = storage + integration; and data hub = data lake + exploration + discovery + analytics. A data reservoir is a well-managed data lake.

All three architectures have in common that data from many sources is copied to one centralized data storage environment. After it has been stored in the data lake (or hub or reservoir), it's available for all business purposes such as integration and analysis. The preferred technology for developing all three is Hadoop. The reason that Hadoop is selected is, firstly, because its low price/performance ratio makes it attractive for developing large scale environments, and secondly, it allows raw data to be stored in its original form and not in a form that already limits how we can use the data later on.

In principle, there is nothing wrong with idea of making all data available in a way that leaves open any form of use, but is moving all the data to one physical data storage environment implemented with Hadoop the smartest alternative?

We have to remember that Hadoop's primary strength and from which it derives its processing scalability, is that it pushes processing to the data, and not vice versa as most other platforms do. Because it's not pushing the data to the processing, Hadoop is able to offer high levels of storage and processing scalability.

Now, isn't this inconsistent with the data lake, data hub, and data reservoir architectures? On one hand we have, for example, the data lake in which all the data is pushed to a centralized data storage environment for processing, and on the other hand, we deploy Hadoop technology, which shines when processing is pushed to the data.

Shouldn't we come up with architectures that are more in line with the fundamental design concept of Hadoop itself, namely architectures where processing (integration and analysis) is pushed to the systems where the data originates (where the data is entered and collected)?

What is needed are systems that make applications and users think that all data is centralized and that all can be freely analyzed. This also avoids massive and continuous data movement exercises and avoids duplicate storage of data. In other words, we need virtual data lakes, virtual data hubs and virtual data reservoirs. Such architectures would be much more in line with Hadoop, which is based on pushing processing to the data. Maybe we should open this discussion first before we recommend organizations to invest heavily in massive data lakes in which they may drown.


Posted May 19, 2014 12:48 AM
Permalink | No Comments |
We all know and understand dashboards. Dashboards consist of visual components in the form of gauges, graphs, pie charts, bar charts, heat maps, and so on, allowing business users to get a quick overview of key performance indicators (KPI).

I thought that we had seen it all with respect to dashboards and that nothing new could be invented. Wrong! Newcomer VisualCue has surprised me completely. This tools supports a new way of presenting KPIs with dashboards, and a new way of working with them.

Besides the more traditional visual components, the key visual component in VisualCue is a tile. In a tile various icons can be used to show multiple KPIs of a process or object. Each icon represents a KPI. A tile can represent, for example, a call center agent, a truck driver, a flight, a hotel, or a project. Heavy use is made of colors to indicate the state of the KPI, and thus the object. The icons can be designed by the customer or they can be selected from a long predefined list.

By showing many tiles next to one each other, business users can see the states of numerous objects or processes at a glance. In other words, the tiles make it possible for business users to see and process a lot of detailed data at a single glance. There is no need to aggregate data which always leads to hiding data, and hiding data can lead to missing opportunities or possible problems. So, with tiles users will see the wood for the trees.

The tiles are dynamic. They can be classified and grouped based on characteristics, for example, call center operators can be grouped based on reached business goals. Business users can also do drill downs on the tiles, showing even more details. And that's where this product really shines.

Again, I thought I had seen everything with respect to dashboards, but clearly I was wrong. The product is still very young, so it still has to proof itself in large, real life projects. But I recommend to check out VisualCue anyway.




Posted May 13, 2014 2:53 AM
Permalink | No Comments |
"Prediction is very difficult, especially if it's about the future." These were the famous words of Nils Bohr, Nobel laureate in Physics. I discovered how true he was when organizing a large pile of  old IT magazines in my office. I found a magazine published by Fawcette Technical Publications. It was a special issue to celebrate their 10th anniversary, and was published in the winter of 2000/2001. The issue was called "The Future of Software" and contained more than 30 short articles written by a long list of experts.

Some of the authors were dead right with their predictions:

  • Avi Silberschatz: Accessing continuously growing database with zero-latency requires new solutions, such as main-memory database systems.
  • Richard Schwartz: The future is for the 10-second application; the challenge is to develop applications requiring an absolute minimal level of keystrokes ...
  • Kent Beck: In the next 10 years ... we jump to a world where we see software development as a conversation between business and technology.

And some were less lucky (I will leave it to the readers to determine which ones were correct):

  • Eric Bonabeau: Swarm intelligence will become a new model for computing.
  • Mitchell Kertzman: The future of software is to be platform-independent, not just OS-independent, but device-independent, network-transport independent ...
  • Thomas Vayda: By 2003, 70% of all new applications will be built primarily form "building blocks" ...

Interesting enough is that the big changes in the IT industry of the last few years were not mentioned at all, such as:

  • Big data
  • Smartphones
  • Social media networks

This shows how hard it really is to predict the future of the IT industry. Maybe it's better to listen to Sir Winston Churchill: "I always avoid prophesying beforehand because it is much better to prophesy after the event has already taken place."


Posted March 2, 2014 11:29 AM
Permalink | No Comments |
If one thing became clear to me at the Strata Conference this month was that the popularity of Hadoop is unmistakable and that SQL-on-Hadoop follows closely in its footsteps. A SQL-on-Hadoop engine makes it possible to access big data, stored in Hadoop HDFS or HBase, using the language so familiar to many developers, namely SQL. SQL-on-Hadoop also makes it easier for popular reporting and analytical tools to access big data in Hadoop.

Tools that have been offering access to non-SQL data sources using SQL for a long time are the data virtualization servers. Most of them allow SQL access to data stored in spreadsheets, XML documents, sequential files, pre-relational database servers, data hidden behind APIs such as SOAP and REST, and data stored in applications such as SAP and Salesforce.com.

Most of the current SQL-on-Hadoop engines offer only SQL query access to one or two data sources: HDFS and HBase. Sounds easy, but it's not. The technical problem they have to solve is how to turn all the non-relational data stored in Hadoop, such as, variable data, self-describing data, and schema-less data , into flat relational structures.

However, the question is whether offering query capabilities on Hadoop is sufficient, because the bar is being raised for SQL-on-Hadoop engines. Some, such as SpliceMachine, offer transactional support on Hadoop in addition to the queries. Others, such as Cirro and ScleraDB, support data federation: data stored in SQL databases can be joined with Hadoop data. So, maybe offering SQL query capabilities on Hadoop will not be enough anymore in the near future.

Data virtualization servers have started to offer access to Hadoop as well, and with that they have entered the market of SQL-on-Hadoop engines. When they do, they will raise the bar for SQL-on-Hadoop engines even more. Current data virtualization servers are not simply runtime engines that offer SQL access to various data sources. Most of them also offer data federation capabilities for many non-SQL data sources , a high-level design and modeling environment with lineage and impact analysis features, caching capabilities to minimize access of the data source, distributed join optimization techniques, and data security features.

In the near future, SQL-on-Hadoop engines are expected to be extended with these typical data virtualization features. And data virtualization servers will have to enrich themselves with full-blown support for Hadoop. But whatever happens, the two markets will slowly converge into one. Products will merge together and others will be extended. This is definitely a market to keep an eye on in the coming years.



Posted February 24, 2014 3:39 AM
Permalink | 1 Comment |
This year at the Strata Conference in Santa Clara, CA it was very clear: The Battle of the SQL-on-Hadoop engines is underway. Many existing and new vendors presented their solutions at the exhibition and many sessions were dedicated to this topic.

As popular as NoSQL was a year ago, so popular is SQL-on-Hadoop today. Here are some of the many implementations: Apache Hive, CitusDB, Cloudera Impala, Concurrency Lingual, Hadapt, InfiniDB, JethroData, MammothDB, MapR Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba, and SpliceMachine.

Besides these implementations, we should also include all the data virtualization products that are designed to access all kinds of data sources including Hadoop and to integrate data from different data sources. Examples are Cirro, Cisco/Composite, Denodo, Informatica IDS, RedHat JBoss Data Virtualization, and Stonebond.

And, of course, we have a few SQL database servers that support polyglot persistence. This means that they can store their data in their own native SQL database or in Hahoop. Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft Polybase, Actian Paraccell, and Teradata Aster database (SQL-H).

Most of these implementations are currently restricted to query the data stored in Hadoop, but some, such as SpliceMachine, support transactions on Hadoop. Most of them don't work with indexes, although JethroData does.

This attention for SQL-on-Hadoop makes a lot of sense. By making all the big data stored in HDFS available through a SQL interface makes it accessible for numerous reporting and analytical tools. It makes big data available for the masses. It's not only for the happy few anymore who are good at programming Java.

If you're interested in SQL-on-Hadoop, you have to study at least two technical aspects. First, how efficient are these engines when executing joins?  Especially joining multiple big tables is a technological challenge. Second, running one query fast is relatively easy, but how well do these engines manage their workload if multiple queries with different characteristics have to be executed concurrently? In other words, how well does the engine manage the query workload? Can one resource-hungry query consume all the available resources, making all the other queries wait? So, don't be influenced too much by single-user benchmarks.

It's easy to predict that we will see many more of these SQL-on-Hadoop implementations entering this market. That the existing products will improve and become faster is evident. The big question is which of them will survive this battle? That not all of them will be commercially successful is evident, but for customers it's important that a selected product still exists after a few years. This is hard to predict today, because the market is still rapidly evolving. Let's see what the status is of this large group of products next year at Strata.


Posted February 20, 2014 1:46 AM
Permalink | No Comments |
PREV 1 2 3 4 5 6