Blog: Rick van der Lans Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at Copyright 2014 Mon, 19 May 2014 00:48:46 -0700 Drowning in Data Lakes and Data Reservoirs warehouse and the data mart, now we also have the data lake, the data reservoir, and the data hub. These new architectures have all been introduced recently and have a direct relationship with the ever more popular Hadoop technology. But don't the fundamentals of Hadoop clash with those of these architectures? Is the technology not inconsistent with the architectures?

Forbes magazine defined data lake by comparing it to a data warehouse: "The difference between a data lake and a data warehouse is that the data in a data warehouse is pre-categorized at the point of entry, which can dictate how it's going to be analyzed." Data hub is a term used by various big data vendors.  In a recent tweet, Clarke Patterson summarized it as follows: data lake = storage + integration; and data hub = data lake + exploration + discovery + analytics. A data reservoir is a well-managed data lake.

All three architectures have in common that data from many sources is copied to one centralized data storage environment. After it has been stored in the data lake (or hub or reservoir), it's available for all business purposes such as integration and analysis. The preferred technology for developing all three is Hadoop. The reason that Hadoop is selected is, firstly, because its low price/performance ratio makes it attractive for developing large scale environments, and secondly, it allows raw data to be stored in its original form and not in a form that already limits how we can use the data later on.

In principle, there is nothing wrong with idea of making all data available in a way that leaves open any form of use, but is moving all the data to one physical data storage environment implemented with Hadoop the smartest alternative?

We have to remember that Hadoop's primary strength and from which it derives its processing scalability, is that it pushes processing to the data, and not vice versa as most other platforms do. Because it's not pushing the data to the processing, Hadoop is able to offer high levels of storage and processing scalability.

Now, isn't this inconsistent with the data lake, data hub, and data reservoir architectures? On one hand we have, for example, the data lake in which all the data is pushed to a centralized data storage environment for processing, and on the other hand, we deploy Hadoop technology, which shines when processing is pushed to the data.

Shouldn't we come up with architectures that are more in line with the fundamental design concept of Hadoop itself, namely architectures where processing (integration and analysis) is pushed to the systems where the data originates (where the data is entered and collected)?

What is needed are systems that make applications and users think that all data is centralized and that all can be freely analyzed. This also avoids massive and continuous data movement exercises and avoids duplicate storage of data. In other words, we need virtual data lakes, virtual data hubs and virtual data reservoirs. Such architectures would be much more in line with Hadoop, which is based on pushing processing to the data. Maybe we should open this discussion first before we recommend organizations to invest heavily in massive data lakes in which they may drown.

]]> Mon, 19 May 2014 00:48:46 -0700
VisualCue: An Alternative Way of Working with BI Dashboards
I thought that we had seen it all with respect to dashboards and that nothing new could be invented. Wrong! Newcomer VisualCue has surprised me completely. This tools supports a new way of presenting KPIs with dashboards, and a new way of working with them.

Besides the more traditional visual components, the key visual component in VisualCue is a tile. In a tile various icons can be used to show multiple KPIs of a process or object. Each icon represents a KPI. A tile can represent, for example, a call center agent, a truck driver, a flight, a hotel, or a project. Heavy use is made of colors to indicate the state of the KPI, and thus the object. The icons can be designed by the customer or they can be selected from a long predefined list.

By showing many tiles next to one each other, business users can see the states of numerous objects or processes at a glance. In other words, the tiles make it possible for business users to see and process a lot of detailed data at a single glance. There is no need to aggregate data which always leads to hiding data, and hiding data can lead to missing opportunities or possible problems. So, with tiles users will see the wood for the trees.

The tiles are dynamic. They can be classified and grouped based on characteristics, for example, call center operators can be grouped based on reached business goals. Business users can also do drill downs on the tiles, showing even more details. And that's where this product really shines.

Again, I thought I had seen everything with respect to dashboards, but clearly I was wrong. The product is still very young, so it still has to proof itself in large, real life projects. But I recommend to check out VisualCue anyway.

]]> Tue, 13 May 2014 02:53:52 -0700
Predicting the Future of IT is Very Difficult
Some of the authors were dead right with their predictions:

  • Avi Silberschatz: Accessing continuously growing database with zero-latency requires new solutions, such as main-memory database systems.
  • Richard Schwartz: The future is for the 10-second application; the challenge is to develop applications requiring an absolute minimal level of keystrokes ...
  • Kent Beck: In the next 10 years ... we jump to a world where we see software development as a conversation between business and technology.

And some were less lucky (I will leave it to the readers to determine which ones were correct):

  • Eric Bonabeau: Swarm intelligence will become a new model for computing.
  • Mitchell Kertzman: The future of software is to be platform-independent, not just OS-independent, but device-independent, network-transport independent ...
  • Thomas Vayda: By 2003, 70% of all new applications will be built primarily form "building blocks" ...

Interesting enough is that the big changes in the IT industry of the last few years were not mentioned at all, such as:

  • Big data
  • Smartphones
  • Social media networks

This shows how hard it really is to predict the future of the IT industry. Maybe it's better to listen to Sir Winston Churchill: "I always avoid prophesying beforehand because it is much better to prophesy after the event has already taken place."

]]> Sun, 02 Mar 2014 11:29:53 -0700
Convergence of Data Virtualization and SQL-on-Hadoop Engines Strata Conference this month was that the popularity of Hadoop is unmistakable and that SQL-on-Hadoop follows closely in its footsteps. A SQL-on-Hadoop engine makes it possible to access big data, stored in Hadoop HDFS or HBase, using the language so familiar to many developers, namely SQL. SQL-on-Hadoop also makes it easier for popular reporting and analytical tools to access big data in Hadoop.

Tools that have been offering access to non-SQL data sources using SQL for a long time are the data virtualization servers. Most of them allow SQL access to data stored in spreadsheets, XML documents, sequential files, pre-relational database servers, data hidden behind APIs such as SOAP and REST, and data stored in applications such as SAP and

Most of the current SQL-on-Hadoop engines offer only SQL query access to one or two data sources: HDFS and HBase. Sounds easy, but it's not. The technical problem they have to solve is how to turn all the non-relational data stored in Hadoop, such as, variable data, self-describing data, and schema-less data , into flat relational structures.

However, the question is whether offering query capabilities on Hadoop is sufficient, because the bar is being raised for SQL-on-Hadoop engines. Some, such as SpliceMachine, offer transactional support on Hadoop in addition to the queries. Others, such as Cirro and ScleraDB, support data federation: data stored in SQL databases can be joined with Hadoop data. So, maybe offering SQL query capabilities on Hadoop will not be enough anymore in the near future.

Data virtualization servers have started to offer access to Hadoop as well, and with that they have entered the market of SQL-on-Hadoop engines. When they do, they will raise the bar for SQL-on-Hadoop engines even more. Current data virtualization servers are not simply runtime engines that offer SQL access to various data sources. Most of them also offer data federation capabilities for many non-SQL data sources , a high-level design and modeling environment with lineage and impact analysis features, caching capabilities to minimize access of the data source, distributed join optimization techniques, and data security features.

In the near future, SQL-on-Hadoop engines are expected to be extended with these typical data virtualization features. And data virtualization servers will have to enrich themselves with full-blown support for Hadoop. But whatever happens, the two markets will slowly converge into one. Products will merge together and others will be extended. This is definitely a market to keep an eye on in the coming years.

]]> Mon, 24 Feb 2014 03:39:30 -0700
The Battle of the SQL-on-Hadoop Engines Strata Conference in Santa Clara, CA it was very clear: The Battle of the SQL-on-Hadoop engines is underway. Many existing and new vendors presented their solutions at the exhibition and many sessions were dedicated to this topic.

As popular as NoSQL was a year ago, so popular is SQL-on-Hadoop today. Here are some of the many implementations: Apache Hive, CitusDB, Cloudera Impala, Concurrency Lingual, Hadapt, InfiniDB, JethroData, MammothDB, MapR Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba, and SpliceMachine.

Besides these implementations, we should also include all the data virtualization products that are designed to access all kinds of data sources including Hadoop and to integrate data from different data sources. Examples are Cirro, Cisco/Composite, Denodo, Informatica IDS, RedHat JBoss Data Virtualization, and Stonebond.

And, of course, we have a few SQL database servers that support polyglot persistence. This means that they can store their data in their own native SQL database or in Hahoop. Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft Polybase, Actian Paraccell, and Teradata Aster database (SQL-H).

Most of these implementations are currently restricted to query the data stored in Hadoop, but some, such as SpliceMachine, support transactions on Hadoop. Most of them don't work with indexes, although JethroData does.

This attention for SQL-on-Hadoop makes a lot of sense. By making all the big data stored in HDFS available through a SQL interface makes it accessible for numerous reporting and analytical tools. It makes big data available for the masses. It's not only for the happy few anymore who are good at programming Java.

If you're interested in SQL-on-Hadoop, you have to study at least two technical aspects. First, how efficient are these engines when executing joins?  Especially joining multiple big tables is a technological challenge. Second, running one query fast is relatively easy, but how well do these engines manage their workload if multiple queries with different characteristics have to be executed concurrently? In other words, how well does the engine manage the query workload? Can one resource-hungry query consume all the available resources, making all the other queries wait? So, don't be influenced too much by single-user benchmarks.

It's easy to predict that we will see many more of these SQL-on-Hadoop implementations entering this market. That the existing products will improve and become faster is evident. The big question is which of them will survive this battle? That not all of them will be commercially successful is evident, but for customers it's important that a selected product still exists after a few years. This is hard to predict today, because the market is still rapidly evolving. Let's see what the status is of this large group of products next year at Strata.

]]> Thu, 20 Feb 2014 01:46:24 -0700
An Overlooked Difference Between SQL-on-Hadoop Engines and Classic SQL Database Servers SQL-on-Hadoop query engines were released, including Drill, Hawq, IBM BigSQL, Impala, and Facebook Presto. This new generation of SQL-on-Hadoop query engines can be compared with classic SQL database servers in many ways. For example, one can study the richness of the SQL dialect, the transaction mechanism, the quality of the query optimizer, or the query performance. Especially the latter is done quite regularly.

In the long run, what's more interesting is the independence between, on one hand, the SQL-on-Hadoop query engines and, on the other hand, the file systems and file storage formats. This is a difference overlooked by many.

Almost all classic SQL database servers come with their own file systems and file formats. That means that we can't develop a database file using, for example, Oracle, and then later on access that same file using DB2. To be able to do that, the data has to exported from the Oracle database and imported in the DB2 database next. In these systems, the SQL query engine plus the file system plus the file format form one indivisible unit. In a way, it's a proprietary stack of software for data storage, data manipulation, and data access.

One of the big advantages of most of the new SQL-on-Hadoop query engines is the independence between the three layers. Query engines, file systems and file formats are interchangeable. The consequence is that data can be inserted using, for example, Impala, and accessed afterwards using Drill. Or, a SQL query engine can seamlessly switch from the Apache HDFS file system to the MapR file system or to the Amazon S3 file system.
The big advantage is that this allows us to deploy different query engines, each with its own strengths and weaknesses, on the same data. There is no need to copy and duplicate the data. For example, for one group of users a SQL engine can be selected that is designed to support high-end, complex analytical queries, and for the other group an engine that's optimized for more simple interactive reporting.

It's comparable to having one flat screen that can be used as TV, computer screen, and as projector screen, instead of having three separate screens.

This independence between the layers is a crucial difference between SQL-on-Hadoop query engines and classic SQL database servers. However, it's not getting the attention it deserves yet. The advantages of it are quite often overlooked. It's a bigger advantage than most people think, especially in the world of big data, where duplicating complete files can be very time-consuming and expensive.

]]> Mon, 27 Jan 2014 01:18:46 -0700
Data and Reality
I bought the book in 1981. The edition I have is old. It uses a non-proportional  font and you can clearly see that typesetting was done with Word, which at that time was the ultimate word processing software tool.

I still consider this book to be the best book on data modeling. Many other great books have been published, but this is still my favorite. One of the reasons I like it is because it addresses very fundamental data modeling questions, such as "Is an object still the same object when all its parts have been replaced?" and "What is really the difference between attributes and relationships?"

Maybe you have never heard of this book or William Kent, but you may possibly know his simple way to remember normalization: "[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key."

I have always recommended this book to the attendees of my data modeling sessions, even though I knew the book was hard to get. The good news is that a new edition had been republished. It includes comments by Steve Hoberman and a note by Chris Date. If you haven't read this book yet, this is your chance. A must-read if you're into data modeling.

]]> Wed, 08 Jan 2014 01:21:14 -0700
Exploring and Investigating Data with CXAIR
Therefore, he needs tools that allow him to freely explore and investigate data. Incorporating more data sets in the analysis should be very easy, there should be no need to specify a goal beforehand, and it must be possible to analyze data in an unguided way.
Besides all the more standard features such as displaying data as bar charts, in dashboards, on geographical maps, the perfect tool for this type of work should support at least the following characteristics:

  • No advance preparations: There should be no need to predefine data structures of the analysis work in advance. If data is available, it should be possible to load it without any preparations, even if it concerns a new type of data.
  • Unguided analysis: Analysts should be able to invoke the analysis technology without having to specify a goal in advance. The exploration technology should allow for analyzing data in an unguided style.
  • Self-service: Analysts should be able to use the analysis techniques without help from IT experts.
Connexica's analysis tool called CXAIR is such a tool. It's natural language interface, venn-diagramming techniques, and visualization features allow users to freely query and analyze data. No cubes or star schemas have to be defined on forehand (which would limit the analysis capabilities).

CXAIR internally organizes all the data using an intelligent index. In fact, internally it's based on text-search technology. This makes it possible to combine and relate data without any form of restriction, which is what analysts need.

Unlike most analysis tools, CXAIR uses search technology that speeds up data analysis. For calculations a mixture of in-memory and on-disk caching is used to analyse massive amounts of data at search engine speeds. All the loaded data resides on the server as it provides a thin client web interface. In other words, no data is loaded on the client machine. Numbers are cached on the server but not text. The fact that CXAIR doesn't cache all the data means that available memory is not a restriction. Cache is used to improve the performance, but large internal memory is not a necessity but will help performance, particularly for ad-hoc calculations.

CXAIR is clearly a representative of a new generation of reporting/analysis tools that users can deploy to freely analyze data. It's a tool for self-service discovery  and investigation of data. It's the tool that many data scientists have been waiting for and is worth checking out.

]]> Tue, 10 Dec 2013 06:52:22 -0700
NoSQL: A Challenge for Data Replication
Many of the NoSQL systems have built-in data replication features--data is automatically stored multiple times. In fact, the developers can set how many replicas have to be created. However, the replication features of NoSQL systems are limited to a homogenous environment. It's not possible to use these features when, for example, data has to be replicated from a NoSQL system to a classic SQL system.

Today, most data replication products can't replicate from or to NoSQL systems. However, if they can in the future, what will be important is that they handle the non-relational concepts of NoSQL systems efficiently. The keyword here is efficiently. Most existing data replication tools have been designed and optimized to copy data between SQL systems. So, they have been optimized to efficiently process relatively short records with a fixed structure. However, NoSQL records are not always short and fixed with respect to structure. NoSQL systems support a wide range of concepts:

  • Many NoSQL systems, including the key-value stores, the document stores, and the column-family stores support extremely long records. These records can be magnitudes longer than what is common in SQL systems. Current data replicators have been optimized to replicate short records.
  • Almost all NoSQL systems support tables in which each record can have a different structure. This is new for data replication products. For example, what will that do to compression algorithms that assume that all records have the same structure?
  • Document stores and column family stores support hierarchical structures. If that type of data has to be replicated into SQL systems, it has to be flattened somehow. The challenge is to do that very fast. But can it be done fast enough? Data replicators are usually not strong at transformations, because it slows down the replication process too much.
  • Column-family stores support what the relational world used to call repeating groups. The same as for hierarchical structures, how can they be mapped to relational structures by the data replication tools efficiently.
There is no question about whether we need data replication technology to replicate between NoSQL and SQL systems. But the key question is whether it can do this efficiently. This is more than adding one extra source to their list of supported products. It requires a substantial redesign of the internals of these products. This is the challenge these vendors are confronted with in the coming years. Hopefully, they will not claim to support NoSQL, while in fact they only replicate data from NoSQL systems if that data has a relational form.

Note: For more on this topic, see the whitepaper Empowering Operational Business Intelligence with Data Replication.

]]> Mon, 29 Apr 2013 01:16:01 -0700
Call for Speakers: BI & DW Conference London Data Warehouse & Business Intelligence European Conference in London coming November? If you are, please fill in the call for speakers.

Previous editions were very successful and attracted more than 200 delegates. Evaluations showed that the attendees were very pleased with the selected speakers, the topics, and setup of the conference.

The 2013 edition is aimed at all aspects of data warehousing and business intelligence, including: trends, design guidelines, product overviews and comparisons, best practices, and new evolving technologies. And like the previous years, the conference is organized together with the highly successful European Data Management and Information Quality Conference.

With this year's call for presentations we are trying to attract proposals for sessions on traditional and future data warehousing and business intelligence aspects. Delegates have expressed a preference for the use of case studies rather than theoretical or abstract topics. We would particularly like practitioners in the field to respond to this call for papers. We encourage new speakers to apply. Success stories - case studies where data warehousing and business intelligence have produced real bottom-line benefits are very much appreciated.

Example topics for proposals are:

  • Agile BI
  • Big data analytics
  • BI in the cloud
  • Data modelling for data warehouses
  • NoSQL in a data warehouse environment
  • The logical data warehouse
  • Data virtualization and data federation
  • The maturity of analytical database servers
  • Star schema, snowflake and data vault models
  • Selling business intelligence to the business
  • The relationship between master data management and data warehousing
  • Guidelines for using ETL tools
  • Operational BI and real-time analytics
  • BAM (Business Activity Monitoring) and KPI (Key Performance Indicators)
  • BI scorecards
  • Customer analytics and insight
  • Text mining and text analytics
  • Open source BI
  • Corporate Performance Management
I am looking forward to your proposal, and hope to see you in London coming November.

Rick F. van der Lans
Chairman of the Data Warehouse & Business Intelligence European Conference 2013

]]> Mon, 15 Apr 2013 08:38:49 -0700
Ted Codd and Twelve Rules for Relational Databases

About two months ago, Pervasive Software asked me to write a whitepaper describing how well their popular PSQL database server supports Codd's twelve rules for relational databases.

For those not familiar with these rules, in 1985, E.F. (Ted) Codd, the founder of the relational model, defined a set of twelve rules for determining how well a database product supports the relational model. These rules make it possible to answer the question whether a particular product is a relational database server. They were urgently needed, because many vendors were labeling their products as relational, while they were not. So, the term relational became somewhat polluted and Codd wanted to fix and prevent this.

The study was a real trip down memory lane. It was a pleasure to reread all those articles and books written by Codd himself and those by Chris Date on, for example, updatable views. The work they did then, was brilliant. So much of what they wrote, is after so many years, still very true.

After studying Pervasive PSQL in detail, my verdict is that it scores a 10 (on a scale of 0 to 12). Nine rules are fully supported, two partially, and two not. Therefore, the overall conclusion is that PSQL is 83% relationally complete. This is an excellent score and puts PSQL in the list of most relational products.

Is it possible to be 100% relational? The answer is yes. Such products can be developed. In fact, there is one open source product that supports most of the rules: Alphora's DataPhor. However, the product is not (yet) a commercial success. In the same year when Codd introduced the twelve rules, he also wrote "No existing DBMS product that I know of can honestly be claimed to be fully relational, at this time." It looks as if this statement still holds for all the SQL products and probably for most database servers.

Note: Now that Pervasive and Actian have merged, maybe I should write a comparable paper for their Ingres and Vectorwise database server, and see which one is the most relational product.

]]> Mon, 15 Apr 2013 07:40:19 -0700
Data Replication for Enabling Operational Business Intelligence
For more and more users OBI is crucial. For example, consider operational management and external parties, such as customers, suppliers, and agents. If we give them access to data to support their decision making processes, in many cases, only operational data is relevant.

But how do we develop BI systems that show operational data? In PowerPoint we can draw an architecture in which BI reports directly access operational databases. And on that PowerPoint slide all seems to work fine. Not in real life, however. Running a BI workload on an operational database can lead to interference, performance degradation, performance instability, and so on. In other words, the operational environment is not going to enjoy this.

This is where data replication can come to the rescue. With data replication we can create and keep a replica of an operational database up to date without interfering with the operational processing. When new data is inserted, updated, or deleted in the original operational database, the replica is updated accordingly and almost instantaneously. This replicated database can then be used for operational reporting and analytics.

Data replication as a technology has been around for a long time, but so far it has been used primarily to increase the availability and/or to distribute the workload of operational systems. My expectation is that data replication will be needed for implementing many new OBI systems. For these products to be ready for BI, besides supporting classic data replication features, such as minimal interference, high throughput, and high availability, they should also support the following three features that are important for BI:

  • Easy to use and easy to maintain: Until now, data replication has been used predominantly in IT departments, and not so much in BI departments or BI Competence Centers. So within these BI groups a minimum of expertise exists with data replication and knowledge on how to embed that technology in BI architectures. Because of this unfamiliarity, it's important that these products are easy to install, easy to manage, and that replication specifications can be changed quickly and easily. A Spartan interface is not appreciated.
  • Heterogeneous data replication: In many organizations the database servers used in these operational environments are different from the ones deployed in their BI environments. Therefore, data replication tools should be able to move data between database servers of different brands.
  • Fast loading into analytical database servers: More and more analytical database servers, such as data warehouse appliances and in-memory database servers, are used to develop data warehouses and/or data marts. These database servers are amazingly fast in running queries. What we don't want is that data is loaded in these products using simple SQL INSERT statements. It will work, but it will be slow. Almost all of these products have specialized interfaces or utilities for fast loading of data. It's vital that data replication products exploit these interfaces or utilities.
To summarize, because of OBI, the need for data replication will increase. It's important that organizations, when they evaluate this technology, study the three features above. For more information on this topic I refer to this whitepaper and this webinar.

]]> Wed, 06 Mar 2013 20:29:16 -0700
The SQL-fication of NoSQL Continues
Adding SQL is a wise decision, because through SQL, (big) data stored in these systems, becomes available to a much larger audience and therefore becomes more valuable to the business. It makes it possible to use a much broader set of products to query and analyze that data. Evidently, not all these SQL implementations are perfect today, but I don't doubt that they will improve over time.

Considering this SQL-fication that's going on, how much longer can we state that the term NoSQL stands for NO SQL? Maybe in a few years we will say that NoSQL stands for Not Originally SQL.

In a way, this transformation reminds me of the history of Ingres. This database server started out as a NoSQL product as well. In the beginning, Ingres supported a database language called Quel (a relational language, but not SQL). Eventually, the market forced them to convert to SQL. Not Originally SQL certainly applies to them.

Anyway, the SQL-fication of NoSQL products and big data has started and continues. To me, this is a great development, because more and more organizations understand what a major asset it is. Therefore, data, any data, big or small, should be stored in systems that can be accessed by as many tools, applications, and users as possible, and that's what SQL offers. Such a valuable asset should not be hidden and buried deep in systems that can only be accessed by experts and technical wizards

]]> Fri, 08 Feb 2013 08:53:24 -0700
Is Herman Hollerith the Grandfather of Big Data?
Halfway through the process, in 1884, it was evident that it would take a long time. Therefore, one of the employees of the USCO was asked to design a machine that would speed up the process for the upcoming 1890 census. This machine had to make it possible to process the enormous amount of data much faster.

That employee was Herman Hollerith. In fact, William Hunt and Charles Pidgin were asked the same question. A benchmark was prepared where all three could demonstrate how fast their solutions were. Coding took 144 hours for Hunt's method, Pidgin's method took 100 hours, and Hollerith's method 72 hours. The processing of the data took respectively 55, 44, and 5 hours. Conclusion, Hollerith's solution was the fastest by far and was, therefore, selected by the USCO.

For the 1890 census, 50,000 men were used to gather the data and to put it on punch cards. It was decided to store much more data attributes: 235 instead of the 6 used in the 1880 census. Hollerith also invented a machine for punching cards. This machine made it possible for one person to produce 700 punch cards per day. Because of Hollerith's machines, 6,000,000 persons could be counted per day. His machines reduced a ten-year job to a few months. In total, his inventions led to $5 million in savings.

Hollerith's ideas for automation of the census are described in Patent No. 395,782 of Jan. 8, 1889 which starts with the following sentence: "The herein described method of compiling statistics ..."

Does this all sound familiar? Massive amounts of data, compiling statistics, the need for a better performance. To me it sounds as if Hollerith was working on the first generation of big data systems.

Hollerith started his own company in 1896 the Computer Tabular Recording Company (CTR). In 1924, after merging with some other companies, the name CTR was changed in IBM. In other words, IBM has always been in the business of big data, analytics, and appliances.

Why did it take so long before we came up with the term big data while, evidently, we have been developing big data systems since the early beginnings of computing? You could say that the first information processing system was a big data system using analytics. This means that Hollerith, besides being a very successful inventor, can be considered the grandfather of big data.

]]> Fri, 01 Feb 2013 07:00:45 -0700
Customer Question 11 on Data Virtualization - How Do I Protect Data?
The issue is the following. When a data virtualization server has been connected to many different data sources and when a user has access to that data virtualization server, potentially he has access to a vast amount of data. Should he be allowed to access all that data, or should certain pieces be hidden? For most organizations the answer is that users are not always allowed to access or change all that data.

All data virtualization servers support a form of data security we usually call authorization. Authorization rules can be defined to control which user is allowed to access which data elements. This is somewhat similar to assigning privileges to users with the GRANT statement in SQL. The following types of privileges are normally supported by data virtualization servers: read, write, execute, select, update, insert, and grant.

Privileges can be granted on the table level, the column level, the row level, and the individual value level. Table-level and column-level privileges are supported by all data virtualization servers. If a user receives a table-level privilege he can see or access all the data in that table. When the user only receives the privilege on a set of columns, some columns will stay hidden.

In some situations, authorization rules have to be defined on a more granular level, namely on individual rows. Imagine that two users may query the same virtual table, but they are not allowed to see the same set of rows. For example, a manager may be allowed to see the data of all the customers, whereas an account manager may only see the customers for whom he is responsible. Row-level privileges have as effect that if two users retrieve data from the same virtual table, they see different sets of rows.

The most granular form of a privilege is a value-level privilege. This allows for defining privileges on individual values in a row. The effect is that some users have access to particular rows, but they won't see some of the values in those rows, or they only see a part of the values. Defining value-level privileges is sometimes referred to as masking.

To summarize, data virtualization products offer a rich set of data security options. Besides the mentioned authorization rules, encryption is usually supported for messages that are being send or received, and so on.

For more information on data security and for more general information on data virtualization I refer to my book "Data Virtualization for Business Intelligence Systems."

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.

]]> Fri, 25 Jan 2013 01:19:53 -0700