Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration and database technology. Currently my special interests include virtual data warehousing, mashups and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence (BI), application integration and database technology. He is managing director and founder of R20/Consultancy. He is an internationally acclaimed speaker who has lectured worldwide for the last 25 years. He is the chairman of the successful annual European Business Intelligence and Data Warehouse Conference held in London and the annual BI event in The Netherlands. Currently, he is promoting a new architecture for data warehousing called the Data Delivery Platform. He is the author of several books on computing, including the popular Introduction to SQL. Some of his books are available in English, Chinese, Dutch, Italian and German. He is also the author of the successful books SQL for MySQL Developers and The SQL Guide to Ingres. Rick may be contacted by sending an email to info@r20.nl.

Editor's Note: More articles and resources are available in Rick van der Lans' BeyeNETWORK Expert Channel. Be sure to visit today!

On October 6, 2011 Informatica organized a virtual conference on data virtualization; see http://bit.ly/puyGZ6. During the live event, attendees took part in various polls and competitions. In one case they were asked to describe what they would choose as their first project for using data virtualization in their organization. As the judge for the competitions, I was fascinated by the results that came in. In a nutshell, these attendees were spot on, they were picking the right projects, they had the right arguments for selecting those projects, and they clearly understood the full potential of data virtualization.

 

Here are some of the statements they made - paraphrased to respect sensitive customer information:

 

  • "With data virtualization, analyst teams can gain quick access to data, profile data, and develop prototypes together with the business to finalize requirements before development."
  • "We are starting to bring in a host of external data to supplement our internal data warehouse. There is often a fair amount of uncertainty of the utility of the information upfront until some amount of BI exploration is complete. To improve the time to market, I would consider tools like data virtualization to be able to provide that information quickly, and hold off on any real data warehouse integration until utility of the data is proven."
  • "We have a data warehouse that lags data by a day or more for exposing key data to our customers. By speeding up the delivery of information to our customers, they can respond more quickly..."

 

These quotes show that data virtualization is being applied for a wide range of purposes. The keywords that jump out are: agile, flexibility, prototyping, real-time data, and fast response.

 

To summarize, data virtualization has reached maturity and organizations are deploying the technology and they know why and where they want to use it.

 

The live poll results reflect this as well. On a poll that asked the question, "What are the use cases you believe data virtualization is applicable in your environment," the majority picked enabling agile BI as their first choice followed by single view and data services for SOA.

 


Posted October 17, 2011 10:15 AM
Permalink | No Comments |

More and more often, Apache's Hadoop is somehow compared to relational databases. In most of those comparisons, Hadoop is presented as a non-relational database, as something that's totally different from classic database servers, such as IBM's DB2, Microsoft SQL Server, and Oracle11g. Comparing Hadoop this way makes no sense. Hadoop can be as relational as those classic database servers.

 

Whether a system is relational does not depend on how data is stored on disk, but fully depends on how the data is perceived by the applications. It depends on what language and/or API the applications use to insert, query, and manipulate the data.

 

In a nutshell, when the relational model was defined and introduced by Tedd Codd, Chris Date and others, the rule was that if a system could present all the data as tables and columns, and if that data could be accessed through a language supporting relational operators such as join, select, and project, that system was a relational system. Tedd Codd called this data independency; an application should not be concerned with how the data was physically stored.

 

What this means is that if a system offers an interface where data is presented as tables and that supports those relational operators, it offers a relational interface. For example, if a system supports a SQL interface on a dataset, that system can be classified as relational. Note: I am aware that SQL does not adhere to all the rules needed to offer a relational interface, but for the sake of simplicity I will regard SQL as relational.

 

We have to make a distinction between, on one hand, the storage model and the storage engine, and, on the other hand, the interface the applications use. Let's call the latter that the interface model. Whether something can be qualified as a relational depends on that interface model and not on the storage model. Whether data in stored as records, in a column-oriented fashion, in a key-value store, or, if possible, in a fish bowl, is irrelevant. The storage model does not determine whether a system is relational or not.

 

Hadoop's HDFS uses a very specific storage model and unique storage engine that are both different from what the classic database servers have implemented. And of course, if we would access the interface of HDFS directly, we wouldn't see a interface that could be called relational, but a very technical low-level interface instead. However, if we would use HiveQL to access the data stored in Hadoop, or if we would use a data virtualization server such as Composite Information Server or Informatica Data Services running on top of Hadoop, in both cases the Hadoop database would be accessed in a relational way, meaning it would become a relational system to the applications.

 

This is not very different from accessing classic relational databases. If we access data via the standard SQL interface, they are relational systems. However, if it would be possible to develop an application that accesses the data by directly accessing their internal storage engines, the same data wouldn't look that relational anymore. By the way, those database servers don't always store the data as records either. For example, in Oracle data can be presented as tables while in fact it's stored as a multi-dimensional cube. And in Sybase IQ data is presented as tables but is stored in a column-oriented fashion using pointer structures.

 

To summarize, whether a system is relational is not dependent on the storage model, but on the language and/or API used to access the data. The same data set can be presented as relational to one application and as not-relational to the other. Hadoop offers a special storage model, but that doesn't mean that data can not be presented in a relational way. In fact, the same applies to most of those new so-called NoSQL database servers.

 

To come back to the comparisons, it would make sense to compare the storage models of Hadoop with those of other database servers, and it would make sense to compare the interfaces of Hadoop with the interfaces of other database servers. But a comparison of the Hadoop storage model with the interface model of classic database servers, is like comparing apples and pears.


Posted September 12, 2011 9:26 AM
Permalink | No Comments |

Maybe some have missed it, but at the end of last year Informatica entered the market of data virtualization/data federation products with Informatica Data Services (IDS). This product has been built on top of the Informatica 9 platform, from which it inherits its robustness and scalability.

 

Besides all the features you expect from a data virtualization product, it does offer some unique ones. For example, virtual tables (views) are not defined by using SQL or XQuery, but with a flow language that resembles the flow language used in PowerCenter for defining ETL scripts. The only difference is that in PowerCenter the result of a flow is stored in some table or file, while with IDS the result is "pushed" to a reporting or analytics tool. Under the hood, the flow language is transformed into SQL and pushed down to the database servers. It will try to process as much of the data access as close to the data as possible.

 

Another feature is that data profiling has been implemented as an integrated part of the product and the profiling can be done in an on-demand style. What that means is that when a virtual table has been defined, by just clicking on a button, the (virtual) contents of the virtual table is profiled. If something looks incorrect, it can be fixed by adding or changing transformation, or by fixing the source data (if allowed and possible). This will become an iterative process that continues until the virtual table returns the right data.

 

In addition, the developer can ask a user or business analyst to look at the virtual table as well. The user can check whether he thinks the contents is ok, and if not, by using a simple Excel-like language, the user can add his own transformations. Eventually, defining the right transformations becomes a collaborative process between users and developers.

 

Complex cleansing operations can also be executed on-demand. In other words, when data is retrieved by a report, IDS will access the underlying data sources and will execute all the cleansing operations.

 

To summarize, IDS shows how feature-rich and mature the data virtualization products are becoming. If you want to know more about how IDS works and what its features are, get my new technical whitepaper Developing a Data Delivery Platform with Informatica Data Services.

 


Posted April 8, 2011 1:57 AM
Permalink | 1 Comment |

Are you interested in speaking at the Data Warehouse & Business Intelligence European Conference in London coming November? If you are, please fill in the call for papers.

Previous editions were very successful and attracted more than 200 delegates. Evaluations showed that the attendees were very pleased with the selected speakers, the topics, and setup of the conference.

The 2011 edition is aimed at all aspects of data warehousing and business intelligence, including: trends, design guidelines, product overviews and comparisons, best practices, and new evolving technologies. And like the previous years, the conference is organized together with the highly successful European Data Management and Information Quality Conference.

With this year's call for presentations we are trying to attract proposals for sessions on traditional and future data warehousing and business intelligence aspects. Delegates have expressed a preference for the use of case studies rather than theoretical or abstract topics. We would particularly like practitioners in the field to respond to this call for papers. We encourage new speakers to apply. Success stories - case studies where data warehousing and business intelligence have produced real bottom-line benefits are very much appreciated.

Example topics for proposals are:

  • Business and data analytics
  • BI in the cloud
  • Data modelling for data warehouses
  • NoSQL in a data warehouse environment
  • The maturity of analytical database servers
  • Star schema, snowflake and data vault models
  • Selling business intelligence to the business
  • Big data analytics
  • The relationship between master data management and data warehousing
  • Guidelines for using ETL tools
  • Data virtualization and data federation
  • The BI mashup
  • The need for Master Data Management in a data warehouse environment
  • BAM (Business Activity Monitoring) and KPI (Key Performance Indicators)
  • New database technology for implementing data warehouses, such as
  • Business intelligence as ROI for the data warehouse
  • Who needs real-time data warehouses?
  • Business Optimization through BPEL, BAM and SOA
  • BI scorecarding
  • Customer analytics and insight
  • Text mining and text analytics
  • Open source BI
  • Corporate Performance Management

Looking forward to your proposal, and hope to see you in London coming November.

Rick F. van der Lans

Chairman of the Data Warehouse & Business Intelligence European Conference 2011


Posted March 24, 2011 6:06 PM
Permalink | No Comments |

This week I had a meeting with Oco, a vendor of analytic SAAS BI Applications based in Waltham, MA. I had never heard of them, but the first versions of their current offerings have been available since 2007 and the company has been around since 1999.

 

Currently, there are many vendors on the market that deliver SAAS BI capabilities. However, most of them deliver tools and technologies. The customer BI applications have to be designed and build first. Not so with Oco. Oco's claim to fame is a set of data models for different application areas, such as buyer performance, supplier evaluation, capacity utilization, revenue trending, and account visibility. This means that the product comes with pre-defined data structures (that customers can adapt to their own needs) plus pre-built BI applications operating on those data models.The attractive part of this is that if the data structures of the customer's production data has been mapped to Oco's data models, the hard part has been done. And very quickly, the business data can be analyzed.

 

Being a SAAS vendor, they host all the software. Data can be uploaded periodically to refresh the database. They have their own products for copying and transforming the data. The database server and data integration technology are all developed with Microsoft software. The front end is partly based on SAP/BO software, such as Xcelsius, WelIntelligence, and Explorer. In addition, some of the front ends, such as the one for KPI Dashboards and Multi-dimensional reporting, are developed with their own tool. It's all Flex based.

 

Oco is an interesting company with an interesting offering. Although they are very much focused on SAP, their market is still very US based. Maybe that's the reason why I had never heard of them?


Posted January 27, 2011 12:10 AM
Permalink | No Comments |
PREV 1 2