We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Quite recently, during my trip to the Silicon Valley in the San Francisco Bay Area, I visited some of the well-known NoSQL vendors. What became obvious after a couple of meetings is that all of them have added or are adding SQL interfaces to their products (or some form of SQL). For example, Cloudera will release Impala, MapR is working on Drill, and DataStax has CQL. These are all SQL interfaces. But the list goes on, data virtualization vendors, such as Composite, Denodo, and Informatic, support access to NoSQL products, Hadapt offers SQL on top of Hadoop, Simba and Quest have released SQL support for various NoSQL products, and MarkLogic, a NoSQL vendor that has developed a search-based transactional database server, will release a SQL interface. In other words, the SQL-fication of NoSQL has started and continues in a rapid pace.

Adding SQL is a wise decision, because through SQL, (big) data stored in these systems, becomes available to a much larger audience and therefore becomes more valuable to the business. It makes it possible to use a much broader set of products to query and analyze that data. Evidently, not all these SQL implementations are perfect today, but I don't doubt that they will improve over time.

Considering this SQL-fication that's going on, how much longer can we state that the term NoSQL stands for NO SQL? Maybe in a few years we will say that NoSQL stands for Not Originally SQL.

In a way, this transformation reminds me of the history of Ingres. This database server started out as a NoSQL product as well. In the beginning, Ingres supported a database language called Quel (a relational language, but not SQL). Eventually, the market forced them to convert to SQL. Not Originally SQL certainly applies to them.

Anyway, the SQL-fication of NoSQL products and big data has started and continues. To me, this is a great development, because more and more organizations understand what a major asset it is. Therefore, data, any data, big or small, should be stored in systems that can be accessed by as many tools, applications, and users as possible, and that's what SQL offers. Such a valuable asset should not be hidden and buried deep in systems that can only be accessed by experts and technical wizards


Posted February 8, 2013 8:53 AM
Permalink | 2 Comments |
The 1880 census for the US was taking a long time, much too long for the decision makers of the United States Census Office (USCO). The reason it took so long was that it was a very manual process. Eventually, it took eight years to complete the exercise.

Halfway through the process, in 1884, it was evident that it would take a long time. Therefore, one of the employees of the USCO was asked to design a machine that would speed up the process for the upcoming 1890 census. This machine had to make it possible to process the enormous amount of data much faster.

That employee was Herman Hollerith. In fact, William Hunt and Charles Pidgin were asked the same question. A benchmark was prepared where all three could demonstrate how fast their solutions were. Coding took 144 hours for Hunt's method, Pidgin's method took 100 hours, and Hollerith's method 72 hours. The processing of the data took respectively 55, 44, and 5 hours. Conclusion, Hollerith's solution was the fastest by far and was, therefore, selected by the USCO.

For the 1890 census, 50,000 men were used to gather the data and to put it on punch cards. It was decided to store much more data attributes: 235 instead of the 6 used in the 1880 census. Hollerith also invented a machine for punching cards. This machine made it possible for one person to produce 700 punch cards per day. Because of Hollerith's machines, 6,000,000 persons could be counted per day. His machines reduced a ten-year job to a few months. In total, his inventions led to $5 million in savings.

Hollerith's ideas for automation of the census are described in Patent No. 395,782 of Jan. 8, 1889 which starts with the following sentence: "The herein described method of compiling statistics ..."

Does this all sound familiar? Massive amounts of data, compiling statistics, the need for a better performance. To me it sounds as if Hollerith was working on the first generation of big data systems.

Hollerith started his own company in 1896 the Computer Tabular Recording Company (CTR). In 1924, after merging with some other companies, the name CTR was changed in IBM. In other words, IBM has always been in the business of big data, analytics, and appliances.

Why did it take so long before we came up with the term big data while, evidently, we have been developing big data systems since the early beginnings of computing? You could say that the first information processing system was a big data system using analytics. This means that Hollerith, besides being a very successful inventor, can be considered the grandfather of big data.


Posted February 1, 2013 7:00 AM
Permalink | No Comments |
In this series of blogs I'm answering common questions on data virtualization. In this blog I address the question: "Do data virtualization products support data security?" In most cases what is meant is whether these products can hide some of the data that is being made available through the data virtualization server.

The issue is the following. When a data virtualization server has been connected to many different data sources and when a user has access to that data virtualization server, potentially he has access to a vast amount of data. Should he be allowed to access all that data, or should certain pieces be hidden? For most organizations the answer is that users are not always allowed to access or change all that data.

All data virtualization servers support a form of data security we usually call authorization. Authorization rules can be defined to control which user is allowed to access which data elements. This is somewhat similar to assigning privileges to users with the GRANT statement in SQL. The following types of privileges are normally supported by data virtualization servers: read, write, execute, select, update, insert, and grant.

Privileges can be granted on the table level, the column level, the row level, and the individual value level. Table-level and column-level privileges are supported by all data virtualization servers. If a user receives a table-level privilege he can see or access all the data in that table. When the user only receives the privilege on a set of columns, some columns will stay hidden.

In some situations, authorization rules have to be defined on a more granular level, namely on individual rows. Imagine that two users may query the same virtual table, but they are not allowed to see the same set of rows. For example, a manager may be allowed to see the data of all the customers, whereas an account manager may only see the customers for whom he is responsible. Row-level privileges have as effect that if two users retrieve data from the same virtual table, they see different sets of rows.

The most granular form of a privilege is a value-level privilege. This allows for defining privileges on individual values in a row. The effect is that some users have access to particular rows, but they won't see some of the values in those rows, or they only see a part of the values. Defining value-level privileges is sometimes referred to as masking.

To summarize, data virtualization products offer a rich set of data security options. Besides the mentioned authorization rules, encryption is usually supported for messages that are being send or received, and so on.

For more information on data security and for more general information on data virtualization I refer to my book "Data Virtualization for Business Intelligence Systems."

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.


Posted January 25, 2013 1:19 AM
Permalink | No Comments |
In this series of blogs I'm answering popular questions on data virtualization. In this blog I address the question: "Do data virtualization products support updates and transactions?" Sometimes this question is phrased as "When I use data virtualization, can I only query the data sources, or can I also change the data in the underlying data sources."

In most situations, when data virtualization is deployed for accessing data stored in data warehouses, staging areas, and data marts. Data is read from those data sources by the data virtualization servers, but is not updated, inserted, or deleted. And because, currently, data virtualization servers are predominantly used in BI and reporting environments, some people may get the feeling that these products do not allow to or cannot update, insert, or delete the data. This is not true.

Some of the data virtualization products were initially designed to be deployed in SOA environments where they could be used to simplify the development of transactions on databases and applications. So, although the focus of some vendors has shifted to BI environments, the ability to run transactions still exists. Most products allow data in data sources (even non-SQL-based sources) to be changed. The data virtualization handles all the transactional aspects of those changes. They even support distributed transactions: when data in multiple data sources are changed, those changes are treated as one atomic transactions. And they support heterogeneous distributed transactions; data in different types of data sources are changed.

Evidently, data can only be changed and distributed transactions can only be executed on data sources that allow and support such functionality. For example, data can't always be changed if the underlying data source is a web service, a Word document, or an external data source. In such situations, data cannot be changed or you're just not allowed to change it.

To summarize, data virtualization servers allow data in the data sources to be changed, and they can guarantee the correct handling of transactions. This makes them suitable for, for example, creating a modular set of services that can be used by applications to change the data. These services hide where data is stored, whether the data has to be changed in multiple systems, how the data is stored, and so on. Note, though, that support for updates and transactions differs between data virtualization servers.

For more information on updates and transactions and for more general information on data virtualization I refer to my book "Data Virtualization for Business Intelligence Systems."

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.


Posted January 23, 2013 7:30 AM
Permalink | No Comments |
Many people still associate the company Quest Software with Java and C# program development. Products such as Toad for Oracle and Toad for SQL Server have been and still are very popular among programmers. What many do not know is that Quest, now owned by Dell, has a very powerful set of business intelligence tools, including Toad Data Point and Toad Decision Point.

Toad Data Point is a cross-platform query and reporting tool designed for data analysts. It allows for visual query building and workflow automation. In addition, and this is a strong point of Quest's BI portfolio, it has very strong data connectivity capabilities. Besides being able to access many classic SQL databases through standard ODBC/JDBC drivers, it comes with a client version of Quest Data Services. Quest Data Services can be seen as a modern data federation product. It simplifies access to data sources and allows users to join data from multiple data sources, including data from Hadoop via Hive.

Toad Decision Point is a cross-platform reporting, analytical, and visualization tool designed for decision makers. It's a highly graphical environment. It also comes with a client version of Quest Data Services. However, with this version of Quest Data Services a wider range of data sources can be accessed, including applications such as Salesforce.com, and NoSQL data sources such as Hadoop, MongoDB, and Cassandra. Some of these NoSQL databases support non-relational concepts, such as super-columns. Quest Data Services makes it possible to access those non-relational concepts in a relational style. This opens up all the valuable data stored in these systems to a large set of tools, reports, and users.

The third core product of Quest's BI stack is Toad Intelligence Central which is a server version of Quest Data Services (the one supported by Toad Decision Point). Because it runs on the server, it makes data source access much more efficient. Because Toad Intelligence Central is server-oriented, users and reports can share specifications. The IT department can enter and manage these centralized specifications making it easier for users to develop and maintain their reports.

In short, the Quest BI portfolio is definitely worthwhile studying. Especially now that they have the backing of Dell itself, their product set could become a serious challenger to the established BI vendors.


Posted January 11, 2013 8:28 AM
Permalink | 2 Comments |