We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

September 2012 Archives

I am happy to announce that my book on data virtualization is available. It was a lot of hard work, but I am proud of the final result.

Early on in the writing process I decided to go for the business intelligence angle and not for a generic book on data virtualization. This explains the title Data Virtualization for Business Intelligence Systems. It allowed me to dive deep with respect to BI-related topics, such as data quality, data integration, effects on data marts, and the impact on data profiling.

Writing a book requires a lot of studying. It means you have to seriously structure and order your knowledge of a topic. Some of that studying leads to great insights. I got some very useful insights when studying for the chapter on design guidelines for data virtualization. In the IT industry many new technologies are introduced every year. Just think about big data, NoSQL, cloud, and so on. But what strikes me is that most of these new technologies are introduced without giving customers any design guidelines. It's up to them to use a trial and error approach to find the proper guidelines, which is an expensive and time-consuming way of discovering how to use new technology the best way.

Therefore I decided to include a chapter in the book on design guidelines when using data virtualization servers. Examples of guidelines are:

  • How to handle incorrect data.
  • Dealing with different users using different definitions for the same concepts.
  • Retrieving data from production systems and the potential interference that results from that.
Hopefully, this chapter will trigger others to come up with more and possibly even better guidelines. I think it's important that when new technology is introduced, guidelines exist.

It was fun to write this book, and I hope it will help to introduce data virtualization in the BI industry, because this technology deserves more attention.

Posted September 10, 2012 12:35 AM
Permalink | 1 Comment |
In a series of blogs I am answering some of the questions a large US-based, health care organization had on data virtualization. I decided to share some of their questions with you, because some of them represent issues that many organizations struggle with.

One question they had was: "Our reporting tool supports data integration features, any thoughts on whether these features are in the same league as Composite, Informatica, Denodo?"

This is an understandable question, because many reporting and analytical tools come with their own built-in data integration capabilities. For example, BusinessObjects, Microsoft PowerPivot, and QlikView, all allow users to enter data integration specifications. So, why buy a separate data virtualization server if you (think you) have that kind of functionality already in place?

There are various reasons why data virtualization servers are valuable:

  • All the data integration specifications entered in a reporting tool can only be used by that particular tool (or by tools of the same vendor). So, if an organization deploys different reporting and analytical tools (and many do), for example SAS, Excel, and Cognos, data integration specifications are replicated in all tools. Keeping them consistent across all the tools, is quite a challenge. With a data virtualization server, these specifications have to be entered only once and can be shared by all the reporting tools. This results in more consistent reporting results, increases productivity, and simplifies maintenance.

  • Besides features for data transformation, data cleansing, and data transformation, data virtualization servers offer a lot more functionality. Most of them support on-demand data profiling capabilities, invocation of data cleansing operations, master data management, special user interfaces for less technical business analysts, modules for lineage and impact analysis, and so on. It's not just data federation and data transformation anymore. Data virtualization servers support comprehensive design and development environments.

  • The technology and techniques for extracting data from data sources is usually more powerful in data virtualization servers than in reporting tools. This is not so strange, because this is the core functionality of data virtualization servers, while extracting and transforming data is not the core functionality of reporting tools. For example, data virtualization servers support advanced query optimization techniques, such as query expansion, query substitution, and ship joins, not supported by the data integration capabilities of reporting tools; they have sophisticated caching mechanisms to improve performance or to offload query processing; powerful data protection features, and so on.

  • Most data virtualization support a wider range of data sources from which they can extract data than reporting tools. They even support the extraction of data from NoSQL data sources, HTML-based websites, and unstructured data sources.
In other words, there are numerous reasons for deploying data virtualization servers in business intelligence systems, even if the reporting tools indeed support some of that functionality. I am not saying that the data integration capabilities of reporting are immature, but data virtualization servers are designed for on-demand data integration, whereas the strength of reporting tools is processing and presenting data.

To me, the most important reason of the list above is the first one: data virtualization servers allow for the centralization of data integration specifications. These data integration specifications are extremely valuable to an organization and should not be distributed and replicated all over the place. I assume every data governance and information management specialist will agree with me on this one.

Note: For more information on data virtualization, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.

Posted September 7, 2012 12:17 AM
Permalink | No Comments |
As indicated in my previous blog, the last few weeks I've been talking to a large health care organization based in the US that is considering to introduce data virtualization in their business intelligence system. Some of the questions they had on data virtualization represent issues that many organizations struggle with today. Therefore, I decided to share a few of their questions with you.

A question they had was: "If I had to pick a platform to build a solution on, among the vendors, Composite, Denodo, and Informatica, knowing what I know right now (August 2012) about these products, which one would I pick?"

I get this question quite regularly. When I do a seminar or present a keynote at a conference, afterwards there is always at least one delegate with a similar question: "What do you think is the best product?" As everyone can imagine, that question can only be answered with the remark "It depends."

The data virtualization servers of Composite, Denodo, and Informatica, are highly competitive products. They definitely all have great capabilities for data integration, data federation, and data modeling. Still, there are differences--they all have their strengths and weaknesses. One is stronger with respect to accessing non-structured and semi-structured data sources, the second is really good with support for ETL-like functionality, and the third has excellent support for data cleansing and data profiling. Currently, together they form the three top products in the world of data virtualization servers. Comparing them is really like comparing apples with apples.

I always recommend customers to do a PoC (Proof Of Concept) to find the product that's "best" for them. The additional advantage of doing a PoC is that it becomes much clearer to organizations what data virtualization could mean for them, how it will (positively) change their BI projects, and it will open their eyes to opportunities.

Therefore, as to be expected, my answer to this organization was: "It depends, do a PoC", which I hope they will be doing within a few weeks, and which I hope many others will be doing in the near future as well, so that they see how data virtualization will help make their BI systems more agile.

Note: For more information on data virtualization, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.

Posted September 5, 2012 10:45 AM
Permalink | No Comments |

The last few weeks I've been talking to a large health care organization based in the US. They are considering to introduce data virtualization in their business intelligence system. Some of their questions on data virtualization led to interesting discussions and insights. My feeling is that some of their questions represent issues that many organizations struggle with today. Therefore, I decided to share a few of their questions and my answers with you.

This was one of their first questions: "Is it accurate to think that in this new paradigm of data virtualization, reporting programs only have to deal with the user interface logic, whereas the data virtualization server does most of the business logic and data manipulation/data integration/data aggregation work? Consequently, does that mean that the task of programming the reports is easier than in a traditional report where the report program contains all the logic (UI + business + data integration)?

To me this is a great question. Let me share my answer with you.

More and more reporting and analytical tools support features for data federation, data aggregation, data manipulation, and data cleansing (for simplicity sake, let's refer to these features with the term data integration). Technically this means that after the tools are hooked up to the required data sources, data integration specifications can be entered to turn all the data into any form, and finally, the development of reports can be started.

This approach has disadvantages. The first one is that it becomes hard to guarantee that all users deploy the same data integration specifications. How do we enforce this? If they all use the same tool, maybe the tool offers features to share those data integration specifications--note that some tools do and some don't. However, not all users use the same tools, which can lead to inconsistent reporting results.

The second disadvantage relates to whether users are aware of all the intricacies of the data sources they access. Imagine that one of the data sources is an old production database. In that database some tricky structures are used. For example, if column A contains the code 1, then the value X in column B means New York, but if the code in column A is equal to 2, then the X in column B means New Jersey. Users have to be aware of all those data-related logic when doing their own data integration work.

For many situations it's recommended to enter data integration specifications only once in a centralized system and let all reports share them. These specifications specify all the necessary data federation, data cleansing, data transformation, and data integration work. In other words, together they hide the intricacies of all the data sources. This approach leads to more consistent reporting.

This is where data virtualization servers come in. With data virtualization servers, such as those from Composite, Denodo, and Informatica, all those data integration specifications can be entered in a more centralized way and are shared by all reports even in a heterogeneous reporting environment.

To come back to the question, the effect is that users can fully focus on the reporting and analysis of data (UI), plus they don't have to spend time on data integration. So yes, the task of programming the reports becomes easier than in a traditional report where the report program contains all the logic (UI + business + data integration).

Note: For more information on data virtualization, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.

Posted September 4, 2012 7:07 AM
Permalink | 2 Comments |

1 2 NEXT