Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

September 2012 Archives

In a series of blogs I answered the questions on data virtualization coming from a particular organization. The following question is not coming from them, but since I've heard the question being asked so many times, I decided to include it in this series.

Question: "If we adopt data virtualization, can we throw away the data warehouse, because we can access the data in the production databases straight on, right?"

Wrong! Data virtualization is not some data warehouse killer. In most projects, where data virtualization is deployed, you will still need a data warehouse. In many systems, if no data warehouse it developed, it won't be possible to implement the information needs of many reports. Let me give the two key reasons:

  • Most production systems do not contain historical data. They were not designed to keep track of historical data. If a value is changed, the old value is deleted. For reports that need to do trend analysis, those deleted values may be needed. Thus, those values have to be stored somewhere. And this is where the data warehouse comes in: data warehouses are needed to store historical data.
  • Production systems may contain inconsistent data. One system may say that a customer is based in New York, while the other system indicates that he is based in Boston. Inconsistencies can't always be solved using software, sometimes human intervention is required to indicate what the correct value is. The result of that intervention must be stored somewhere, so that it can be reused. Again, that's where a data warehouse comes in.
And there are more reasons why an additional database is needed: the data warehouse. If that data warehouse would not exist, and if the data virtualization server is connected to the production systems, it would have no idea how to retrieve the historical data because it wouldn't exist, and it would not know how to determine which of the inconsistent values is the right one.

Worthwhile to mention is that if a data warehouse system consists of a data warehouse and deploys data virtualization, then (physical) data marts may not be needed anymore when they contain data derived from the data warehouse. Such data marts can be simulated by the data virtualization server. We usually refer to them as virtual data marts.

So, introducing data virtualization in a data warehouse system does not imply throwing away the data warehouse. The data warehouse is still needed.

Note: For more information on data virtualization, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.


Posted September 27, 2012 12:04 PM
Permalink | No Comments |
In their 2011 study, the Aberdeen Group showed that 43% of the enterprises indicated that making timely decisions is becoming more difficult (The Agile BI Benchmark). Managers increasingly find they have less time to make decisions after certain business events occur. This is an important observation for every BI department, because it implies that it must be possible to change existing reports faster and to develop new reports more quickly.

The business need to improve the development speed and to increase agility is probably the reason why concepts such as agile BI and self-service BI have received so much attention lately. The Data Warehouse Institute recently identified five factors driving businesses toward self-service BI (2011 TDWI BI Benchmark Report: Organizational and Performance Metrics for Business Intelligence Teams) of which factors 1, 2, and 4, relate directly to productivity and agility:

  1. Constantly changing business needs (65%)
  2. IT's inability to satisfy new requests in a timely manner (57%)
  3. The need to be a more analytics-driven organization (54%)
  4. Slow and untimely access to information (47%)
  5. Business user dissatisfaction with IT-delivered BI capabilities (34%)
Many vendors of data virtualization servers have made statements claiming that deploying their products do lead to an increase of productivity and an improvement of agility in BI projects. But how? I will give two dominant reasons.

First, with data virtualization, reports are decoupled from the data sources. The effect is that it becomes easier to change the data storage solution without having to change the reports (or vice versa). For example, a generic SQL database server can be replaced by a faster analytical database server without having to change one line of code in the reports. Another example where decoupling is useful is when a reporting query is redirected away from a database that contains derived data to one that contains original data. This means that the derived data can be removed thus simplifying the entire system. Such a change is transparent to the reports. Simplification is (almost) always good for productivity and agility.

Second, if users need access to data they haven't used yet, making that data available to them using data virtualization is usually very easy, even if it concerns a NoSQL product or unstructured data. Maybe that first implementation is slow, but it works, and users can deploy it and may even experience a ROI. In parallel with the use of this first implementation, the IT department can look for a more efficient implementation (if required). Again, making changes afterwards, has no impact on the reports.

More reasons exist that explain why data virtualization makes BI systems more agile and productive, but I think these two are the main ones. For more on how DV increases productivity and agility, I refer to the webinar series entitled Agile Data Integration for BI Lecture Series.


Posted September 25, 2012 2:49 AM
Permalink | No Comments |
I still hear people say that data virtualization is just a new name for data federation--old wine in new skins. Not true! If we look at the current generation of data virtualization servers, we have to conclude that they offer a lot more functionality than data federation servers.

Data federation is run-time technology that makes it easy for an application to access a heterogeneous set of data stores. In this case, the data federator deals with all the different API's, the different database languages, it will try to optimize access to those data stores by doing distributed join optimization, and it will handle all the issues of distributed transactions.

In my book on data virtualization (Data Virtualization for Business Intelligence Systems), I define data federation as follows:

Data federation is an aspect of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.

Data virtualization is much more than data federation. Here are some of the features supported by data virtualization servers today:

  • Self-service, iterative, and collaborative development
  • (Canonical) data modeling
  • On-demand data profiling and data cleansing
  • Full support for the entire development life cycle: business glossary, information modeling
  • Extensive data integrity features
  • Extensive master data management features
  • Integration of different data integration styles, including ETL, ELT, and replication
In a nutshell, where data federation is primarily run-time technology, data virtualization supports the entire system development life cycle. So, it supports modeling and design as well. Or, to you use popular terminology, data virtualization = data federation++.

If you want to know more about this topic, attend my session at the The Data Virtualization Experts Forum.


Posted September 21, 2012 1:41 AM
Permalink | No Comments |
In a series of blogs I am answering some of the questions a large US-based, health care organization had on data virtualization. I decided to share some of their questions with you, because some of them represent issues that many organizations struggle with.

Their question: "There is a theory being proposed within the organization that we should just replicate tables from all systems into a common database, then define views on top and we are home. Are we now oversimplifying the whole field of data integration?"

This question can be generalized to one I get very often: "Every SQL database server supports views and they are like virtual tables in data virtualization servers, so what's the difference?" On a very high abstraction level and if we leave out hundreds of (important) details, the two may look quite identical, but the differences are numerous. Here are a few features offered by data virtualization servers not normally found by views in SQL database servers:

  • In the definitions of virtual tables advanced cleansing operations can be invoked, which is impossible or hard to do in most SQL views.
  • Virtual tables can point to a wide range of data sources, including NoSQL data sources, HTML-based websites, and unstructured data sources. In many SQL database servers, views can only be defined on SQL tables or flat files.
  • Data virtualization servers support designer-friendly user interfaces for designing virtual tables. This does not apply to most SQL database servers.
  • Advanced caching mechanisms exist to store the contents of the virtual tables. This is comparable to materializing views. However, caches in data virtualization servers offer more advanced features for refreshing the contents.
  • Data virtualization servers offer on-demand data profiling capabilities to study the quality of the content of virtual tables.
  • Data virtualization servers support lineage and impact analysis to determine the relationships between all the virtual tables and the data sources. This is important for development and maintenance.
  • Virtual tables can be published in different technical forms, such as SQL tables, SOAP-based services, REST-based services, MDX tables, and so on. In most SQL systems, views can only be presented as SQL tables.

Hopefully, this convinces most people of the differences between the two. I heard someone once summarize the differences as follows: "Virtual tables are like SQL views on steroids." I agree.

Note: For more information on data virtualization, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.


Posted September 18, 2012 12:02 AM
Permalink | No Comments |
In a series of blogs I am answering some of the questions a large US-based, health care organization had on data virtualization. I decided to share some of their questions with you, because some of them represent issues that many organizations struggle with.

One question they had was related to their development approach: "Would it make sense to design the topmost layer of our virtual tables as close enough as possible to what the reports need? Then, create tables in Microsoft Access with the same table structure as those virtual tables and let the report programmers start building reports against these Microsoft Access tables. Meanwhile, the core team designs the data virtualization model (mapping the virtual tables to the real data sources). This way the two activities (1) programming of reports and (2) building of the virtualization model can be executed in parallel. In two months, when both activities are complete, they meet, and we switch the reports to point to the virtual views rather than to the Microsoft Access tables. Is this a sound approach, or are we stretching it too far?"

My direct response was: "No, you're not stretching it at all, I think you're getting it." What I added was that the only issue might be that minor SQL dialect differences can exist between that of the data virtualization server and Microsoft Access.

Their response: "Good point. We think this can be overcome. We can create the Microsoft Access tables. Then use that as a data source for the data virtualization server, create virtual tables that correspond 1:1 to the tables in Access. Next, have programmers code reports against those virtual tables that point to the Access tables. Finally, when the reports are ready, we redirect the virtual tables to the real data sources."

Evidently, this is the preferred approach, because this means that the reports always access the same virtual tables, even if the switch is made from Microsoft Access to the real data source. This redirecting of the virtual tables is completely transparent to the reports, and they will run unchanged. In addition, redirecting involves almost no work at all.

Just to be clear, I am not recommending Microsoft Access as the preferred platform for developing virtual tables, but I like how this customer is thinking about how to use the power of data virtualization servers to come up with a very efficient and agile development approach. Because data virtualization servers decouple the real data sources from the reports, changes (even drastic ones) can be made to the data sources without having to change the reports, or vice versa. In fact, this is why it's often said that data virtualization makes business intelligence systems more agile.

Note: For more information on data virtualization, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.


Posted September 14, 2012 2:09 AM
Permalink | 1 Comment |
PREV 1 2