Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

November 2012 Archives

In a series of blogs I am answering some of the questions a large US-based, health care organization had on data virtualization. I decided to share some of their questions with you, because some of them represent issues that many organizations struggle with.

For those not familiar with the concept or name, a Rube Goldberg machine, contraption, invention, device, or apparatus is a deliberately over-engineered or overdone machine that performs a very simple task in a very complex fashion, usually including a chain reaction--this is the definition used by Wikipedia. Examples of very simple tasks are pouring beer in a glass, opening up a door, or switching on a TV. Rube Goldberg was an American cartoonist and was most popular for drawing weird machines. Here you can find a photo showing an example of a Rube Goldberg machine.  On YouTube you can find numerous films showing such machines at work. One you have to see is the one developed by a young kid called Audri.

Why a discussion on these weird and often useless machines? Quite recently, I received an email from my customer with the following remark: "I have been thinking about the complexities of physical integration of our systems [with physical integration he means using classic ETL and duplicating data in several databases]. I wish I had Audri's YouTube video when I was trying to urge my team to consider data virtualization. After seeing that little boy, it feels as if developing and testing a system based on physical integration, is like trying to develop and test a Rube Goldberg machine."

He continues with "Knowing what I know now [after studying data virtualization more seriously], if data virtualization is comparable to using a remote control to turn a TV on and off, then physical integration is comparable to developing and using a Rube Goldberg machine to turn a TV on and off."

Evidently, this is an exaggeration, because there are still various situations for which you have to or want to deply a form of physical integration. But there is some truth in it. Software and hardware are currently so much more powerful than ten years ago. In fact, there is so much more "power" available that if organizations would have to design their current BI systems from scratch, they would probably come up with much simpler architectures, ones in which agility would be a fundamental design factor. Data virtualization would be one of the technologies that would clearly help to develop more agile BI systems.

So, years ago, when we designed the architectures of our BI systems, they were not considered Rube Goldberg machines. They were necessities, there was no other choice. But today there is. So, if we look at these architectures today, they do resemble Rube Goldberg machines. They are like machines in which the data values roll down a spiral, are thrown from one database to another, are changed occasionally, fall of some track sporadically, and sometimes even float a few inches, before they arrive in a report.

I have decided to use Audri's film from now on to explain what the differences are between developing BI systems with and without data virtualization.

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.
 


Posted November 19, 2012 11:15 AM
Permalink | 1 Comment |
In this series of blogs I'm answering questions on data virtualization coming from a particular organization. Even though the question "Is data virtualization immature technology?" is not coming from them, I decided to include it in this series, because it's being asked so frequently.

Probably because the term data virtualization is relatively young, some think it's also young technology and thus immature and only usable in small environments. This is a misunderstanding. Therefore, I decided to give a feeling of the long and rich history of data virtualization, making use of extracts from my book "Data Virtualization for Business Intelligence Systems."

Fundamental to data virtualization are the concepts abstraction and encapsulation. These concepts have their origin in the early 1970s. Exactly forty years ago, in 1972, David L. Parnas wrote a groundbreaking article "On the Criteria to be Used in Decomposing Systems into Modules." In this to me legendary article, Parnas explains how important it is that applications are developed in such a way that they become independent of the structure of the stored data. The big advantage of this concept is that if one changes, the other may not have to change. In addition, by hiding technical details, applications become easier to maintain, or to use more modern terms, they become more agile. Parnas calls this information hiding and worded it as follows: "... the purpose of [information] hiding is to make inaccessible certain details that should not affect other parts of a system."

Information hiding eventually became the basis for popular concepts, such as, object-orientation, component based development, and more currently service oriented architectures. All three have encapsulation and abstraction as foundation. No one questions the values of those three concepts anymore.

But Parnas was not the only one who saw the value of encapsulation and abstraction. The most influential paper in the history of data management "A Relational Model of Data for Large Shared Data Banks", written by E.F. Codd, founder of the relational model, started as follows: "Future users of large data banks must be protected from having to know how the data is organized [...] application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed." He used different terms as Parnas, but he clearly had the same vision. This illustrates one fundamental principle of computer science which is at the root of data virtualization: applications should be independent of the complexities of accessing data.

At the end of the 1970s, a concept called the three schema approach (or concept) was introduced and thoroughly researched. G.M. (Sjir) Nijssen was one of the driving forces behind most of the research in this area. Nijssen wrote numerous articles on this topic. Again, abstraction and encapsulation were the driving forces.

A personal note: In 1979 I started my IT career working for Nijssen in Brussels, Belgium, when all that research was going on. I didn't realize it at that time, but obviously data virtualization has played a role in my career from day one.

Technologically, data virtualization owes a lot to distributed database technology and federation servers. Most of the initial research for data federation was done by IBM in their famous System R* project which started way back in 1979. Another project that contributed heavily to distributed queries, was the Ingres project which eventually led to the open source SQL database server called Ingres, now distributed by Actian Corporation. System R* was a follow-up project to IBM's System R project--the birth place of SQL. Eventually, System R led to the development of most of IBM's commercial SQL database servers, including SQL/DS and DB2.

The forerunners of data virtualization servers can not be omitted here. The first products that deserve the label data federation server are IBM's DataJoiner and Information Builder's EDA/SQL (Enterprise Data Access). The former was introduced in the early 1990s and the latter in 1991. Both were not database servers, but were primarily products for integrating data from different data sources. Besides being able to access most SQL database servers, they were the first products to provide a SQL interface to non-SQL databases. Both products have matured and have undergone several name changes. After being part of IBM DB2 Information Integrator, DataJoiner is currently called IBM InfoSphere Federation Server, and EDA/SQL has been renamed into iWay Data Hub and is part of Information Builders' Enterprise Information Integration Suite.

I could list more technologies, research projects, and products that have been fundamental to the development of data virtualization, but I will stop here. This already impressive list clearly shows the long history and the serious amount of research that has gone into data virtualization and its forerunners. So, maybe the term data virtualization is young, but the technology definitely isn't. Therefore, classifying data virtualization as young and immature, would not be accurate.

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.


Posted November 12, 2012 7:50 AM
Permalink | No Comments |