Blog: Rick van der Lans http://www.b-eye-network.com/blogs/vanderlans/ Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl. Copyright 2013 Mon, 29 Apr 2013 01:16:01 -0700 http://www.movabletype.org/?v=4.261 http://blogs.law.harvard.edu/tech/rss NoSQL: A Challenge for Data Replication
Many of the NoSQL systems have built-in data replication features--data is automatically stored multiple times. In fact, the developers can set how many replicas have to be created. However, the replication features of NoSQL systems are limited to a homogenous environment. It's not possible to use these features when, for example, data has to be replicated from a NoSQL system to a classic SQL system.

Today, most data replication products can't replicate from or to NoSQL systems. However, if they can in the future, what will be important is that they handle the non-relational concepts of NoSQL systems efficiently. The keyword here is efficiently. Most existing data replication tools have been designed and optimized to copy data between SQL systems. So, they have been optimized to efficiently process relatively short records with a fixed structure. However, NoSQL records are not always short and fixed with respect to structure. NoSQL systems support a wide range of concepts:

  • Many NoSQL systems, including the key-value stores, the document stores, and the column-family stores support extremely long records. These records can be magnitudes longer than what is common in SQL systems. Current data replicators have been optimized to replicate short records.
  • Almost all NoSQL systems support tables in which each record can have a different structure. This is new for data replication products. For example, what will that do to compression algorithms that assume that all records have the same structure?
  • Document stores and column family stores support hierarchical structures. If that type of data has to be replicated into SQL systems, it has to be flattened somehow. The challenge is to do that very fast. But can it be done fast enough? Data replicators are usually not strong at transformations, because it slows down the replication process too much.
  • Column-family stores support what the relational world used to call repeating groups. The same as for hierarchical structures, how can they be mapped to relational structures by the data replication tools efficiently.
There is no question about whether we need data replication technology to replicate between NoSQL and SQL systems. But the key question is whether it can do this efficiently. This is more than adding one extra source to their list of supported products. It requires a substantial redesign of the internals of these products. This is the challenge these vendors are confronted with in the coming years. Hopefully, they will not claim to support NoSQL, while in fact they only replicate data from NoSQL systems if that data has a relational form.

Note: For more on this topic, see the whitepaper Empowering Operational Business Intelligence with Data Replication.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/04/nosql_a_challen.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/04/nosql_a_challen.php Mon, 29 Apr 2013 01:16:01 -0700
Call for Speakers: BI & DW Conference London Data Warehouse & Business Intelligence European Conference in London coming November? If you are, please fill in the call for speakers.

Previous editions were very successful and attracted more than 200 delegates. Evaluations showed that the attendees were very pleased with the selected speakers, the topics, and setup of the conference.

The 2013 edition is aimed at all aspects of data warehousing and business intelligence, including: trends, design guidelines, product overviews and comparisons, best practices, and new evolving technologies. And like the previous years, the conference is organized together with the highly successful European Data Management and Information Quality Conference.

With this year's call for presentations we are trying to attract proposals for sessions on traditional and future data warehousing and business intelligence aspects. Delegates have expressed a preference for the use of case studies rather than theoretical or abstract topics. We would particularly like practitioners in the field to respond to this call for papers. We encourage new speakers to apply. Success stories - case studies where data warehousing and business intelligence have produced real bottom-line benefits are very much appreciated.

Example topics for proposals are:

  • Agile BI
  • Big data analytics
  • BI in the cloud
  • Data modelling for data warehouses
  • NoSQL in a data warehouse environment
  • The logical data warehouse
  • Data virtualization and data federation
  • The maturity of analytical database servers
  • Star schema, snowflake and data vault models
  • Selling business intelligence to the business
  • The relationship between master data management and data warehousing
  • Guidelines for using ETL tools
  • Operational BI and real-time analytics
  • BAM (Business Activity Monitoring) and KPI (Key Performance Indicators)
  • BI scorecards
  • Customer analytics and insight
  • Text mining and text analytics
  • Open source BI
  • Corporate Performance Management
I am looking forward to your proposal, and hope to see you in London coming November.

Rick F. van der Lans
Chairman of the Data Warehouse & Business Intelligence European Conference 2013

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/04/call_for_speake_1.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/04/call_for_speake_1.php Mon, 15 Apr 2013 08:38:49 -0700
Ted Codd and Twelve Rules for Relational Databases

About two months ago, Pervasive Software asked me to write a whitepaper describing how well their popular PSQL database server supports Codd's twelve rules for relational databases.

For those not familiar with these rules, in 1985, E.F. (Ted) Codd, the founder of the relational model, defined a set of twelve rules for determining how well a database product supports the relational model. These rules make it possible to answer the question whether a particular product is a relational database server. They were urgently needed, because many vendors were labeling their products as relational, while they were not. So, the term relational became somewhat polluted and Codd wanted to fix and prevent this.

The study was a real trip down memory lane. It was a pleasure to reread all those articles and books written by Codd himself and those by Chris Date on, for example, updatable views. The work they did then, was brilliant. So much of what they wrote, is after so many years, still very true.

After studying Pervasive PSQL in detail, my verdict is that it scores a 10 (on a scale of 0 to 12). Nine rules are fully supported, two partially, and two not. Therefore, the overall conclusion is that PSQL is 83% relationally complete. This is an excellent score and puts PSQL in the list of most relational products.

Is it possible to be 100% relational? The answer is yes. Such products can be developed. In fact, there is one open source product that supports most of the rules: Alphora's DataPhor. However, the product is not (yet) a commercial success. In the same year when Codd introduced the twelve rules, he also wrote "No existing DBMS product that I know of can honestly be claimed to be fully relational, at this time." It looks as if this statement still holds for all the SQL products and probably for most database servers.

Note: Now that Pervasive and Actian have merged, maybe I should write a comparable paper for their Ingres and Vectorwise database server, and see which one is the most relational product.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/04/codds_twelve_ru.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/04/codds_twelve_ru.php Mon, 15 Apr 2013 07:40:19 -0700
Data Replication for Enabling Operational Business Intelligence
For more and more users OBI is crucial. For example, consider operational management and external parties, such as customers, suppliers, and agents. If we give them access to data to support their decision making processes, in many cases, only operational data is relevant.

But how do we develop BI systems that show operational data? In PowerPoint we can draw an architecture in which BI reports directly access operational databases. And on that PowerPoint slide all seems to work fine. Not in real life, however. Running a BI workload on an operational database can lead to interference, performance degradation, performance instability, and so on. In other words, the operational environment is not going to enjoy this.

This is where data replication can come to the rescue. With data replication we can create and keep a replica of an operational database up to date without interfering with the operational processing. When new data is inserted, updated, or deleted in the original operational database, the replica is updated accordingly and almost instantaneously. This replicated database can then be used for operational reporting and analytics.

Data replication as a technology has been around for a long time, but so far it has been used primarily to increase the availability and/or to distribute the workload of operational systems. My expectation is that data replication will be needed for implementing many new OBI systems. For these products to be ready for BI, besides supporting classic data replication features, such as minimal interference, high throughput, and high availability, they should also support the following three features that are important for BI:

  • Easy to use and easy to maintain: Until now, data replication has been used predominantly in IT departments, and not so much in BI departments or BI Competence Centers. So within these BI groups a minimum of expertise exists with data replication and knowledge on how to embed that technology in BI architectures. Because of this unfamiliarity, it's important that these products are easy to install, easy to manage, and that replication specifications can be changed quickly and easily. A Spartan interface is not appreciated.
  • Heterogeneous data replication: In many organizations the database servers used in these operational environments are different from the ones deployed in their BI environments. Therefore, data replication tools should be able to move data between database servers of different brands.
  • Fast loading into analytical database servers: More and more analytical database servers, such as data warehouse appliances and in-memory database servers, are used to develop data warehouses and/or data marts. These database servers are amazingly fast in running queries. What we don't want is that data is loaded in these products using simple SQL INSERT statements. It will work, but it will be slow. Almost all of these products have specialized interfaces or utilities for fast loading of data. It's vital that data replication products exploit these interfaces or utilities.
To summarize, because of OBI, the need for data replication will increase. It's important that organizations, when they evaluate this technology, study the three features above. For more information on this topic I refer to this whitepaper and this webinar.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/03/data_replicatio.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/03/data_replicatio.php Wed, 06 Mar 2013 20:29:16 -0700
The SQL-fication of NoSQL Continues
Adding SQL is a wise decision, because through SQL, (big) data stored in these systems, becomes available to a much larger audience and therefore becomes more valuable to the business. It makes it possible to use a much broader set of products to query and analyze that data. Evidently, not all these SQL implementations are perfect today, but I don't doubt that they will improve over time.

Considering this SQL-fication that's going on, how much longer can we state that the term NoSQL stands for NO SQL? Maybe in a few years we will say that NoSQL stands for Not Originally SQL.

In a way, this transformation reminds me of the history of Ingres. This database server started out as a NoSQL product as well. In the beginning, Ingres supported a database language called Quel (a relational language, but not SQL). Eventually, the market forced them to convert to SQL. Not Originally SQL certainly applies to them.

Anyway, the SQL-fication of NoSQL products and big data has started and continues. To me, this is a great development, because more and more organizations understand what a major asset it is. Therefore, data, any data, big or small, should be stored in systems that can be accessed by as many tools, applications, and users as possible, and that's what SQL offers. Such a valuable asset should not be hidden and buried deep in systems that can only be accessed by experts and technical wizards

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/02/the_sql-ficatio.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/02/the_sql-ficatio.php Fri, 08 Feb 2013 08:53:24 -0700
Is Herman Hollerith the Grandfather of Big Data?
Halfway through the process, in 1884, it was evident that it would take a long time. Therefore, one of the employees of the USCO was asked to design a machine that would speed up the process for the upcoming 1890 census. This machine had to make it possible to process the enormous amount of data much faster.

That employee was Herman Hollerith. In fact, William Hunt and Charles Pidgin were asked the same question. A benchmark was prepared where all three could demonstrate how fast their solutions were. Coding took 144 hours for Hunt's method, Pidgin's method took 100 hours, and Hollerith's method 72 hours. The processing of the data took respectively 55, 44, and 5 hours. Conclusion, Hollerith's solution was the fastest by far and was, therefore, selected by the USCO.

For the 1890 census, 50,000 men were used to gather the data and to put it on punch cards. It was decided to store much more data attributes: 235 instead of the 6 used in the 1880 census. Hollerith also invented a machine for punching cards. This machine made it possible for one person to produce 700 punch cards per day. Because of Hollerith's machines, 6,000,000 persons could be counted per day. His machines reduced a ten-year job to a few months. In total, his inventions led to $5 million in savings.

Hollerith's ideas for automation of the census are described in Patent No. 395,782 of Jan. 8, 1889 which starts with the following sentence: "The herein described method of compiling statistics ..."

Does this all sound familiar? Massive amounts of data, compiling statistics, the need for a better performance. To me it sounds as if Hollerith was working on the first generation of big data systems.

Hollerith started his own company in 1896 the Computer Tabular Recording Company (CTR). In 1924, after merging with some other companies, the name CTR was changed in IBM. In other words, IBM has always been in the business of big data, analytics, and appliances.

Why did it take so long before we came up with the term big data while, evidently, we have been developing big data systems since the early beginnings of computing? You could say that the first information processing system was a big data system using analytics. This means that Hollerith, besides being a very successful inventor, can be considered the grandfather of big data.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/02/is_big_data_the.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/02/is_big_data_the.php Fri, 01 Feb 2013 07:00:45 -0700
Customer Question 11 on Data Virtualization - How Do I Protect Data?
The issue is the following. When a data virtualization server has been connected to many different data sources and when a user has access to that data virtualization server, potentially he has access to a vast amount of data. Should he be allowed to access all that data, or should certain pieces be hidden? For most organizations the answer is that users are not always allowed to access or change all that data.

All data virtualization servers support a form of data security we usually call authorization. Authorization rules can be defined to control which user is allowed to access which data elements. This is somewhat similar to assigning privileges to users with the GRANT statement in SQL. The following types of privileges are normally supported by data virtualization servers: read, write, execute, select, update, insert, and grant.

Privileges can be granted on the table level, the column level, the row level, and the individual value level. Table-level and column-level privileges are supported by all data virtualization servers. If a user receives a table-level privilege he can see or access all the data in that table. When the user only receives the privilege on a set of columns, some columns will stay hidden.

In some situations, authorization rules have to be defined on a more granular level, namely on individual rows. Imagine that two users may query the same virtual table, but they are not allowed to see the same set of rows. For example, a manager may be allowed to see the data of all the customers, whereas an account manager may only see the customers for whom he is responsible. Row-level privileges have as effect that if two users retrieve data from the same virtual table, they see different sets of rows.

The most granular form of a privilege is a value-level privilege. This allows for defining privileges on individual values in a row. The effect is that some users have access to particular rows, but they won't see some of the values in those rows, or they only see a part of the values. Defining value-level privileges is sometimes referred to as masking.

To summarize, data virtualization products offer a rich set of data security options. Besides the mentioned authorization rules, encryption is usually supported for messages that are being send or received, and so on.

For more information on data security and for more general information on data virtualization I refer to my book "Data Virtualization for Business Intelligence Systems."

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/01/customer_questi_10.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/01/customer_questi_10.php Fri, 25 Jan 2013 01:19:53 -0700
Customer Question 10 on Data Virtualization - What About Updates and Transactions?
In most situations, when data virtualization is deployed for accessing data stored in data warehouses, staging areas, and data marts. Data is read from those data sources by the data virtualization servers, but is not updated, inserted, or deleted. And because, currently, data virtualization servers are predominantly used in BI and reporting environments, some people may get the feeling that these products do not allow to or cannot update, insert, or delete the data. This is not true.

Some of the data virtualization products were initially designed to be deployed in SOA environments where they could be used to simplify the development of transactions on databases and applications. So, although the focus of some vendors has shifted to BI environments, the ability to run transactions still exists. Most products allow data in data sources (even non-SQL-based sources) to be changed. The data virtualization handles all the transactional aspects of those changes. They even support distributed transactions: when data in multiple data sources are changed, those changes are treated as one atomic transactions. And they support heterogeneous distributed transactions; data in different types of data sources are changed.

Evidently, data can only be changed and distributed transactions can only be executed on data sources that allow and support such functionality. For example, data can't always be changed if the underlying data source is a web service, a Word document, or an external data source. In such situations, data cannot be changed or you're just not allowed to change it.

To summarize, data virtualization servers allow data in the data sources to be changed, and they can guarantee the correct handling of transactions. This makes them suitable for, for example, creating a modular set of services that can be used by applications to change the data. These services hide where data is stored, whether the data has to be changed in multiple systems, how the data is stored, and so on. Note, though, that support for updates and transactions differs between data virtualization servers.

For more information on updates and transactions and for more general information on data virtualization I refer to my book "Data Virtualization for Business Intelligence Systems."

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/01/customer_questi_9.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/01/customer_questi_9.php Wed, 23 Jan 2013 07:30:58 -0700
Quest BI: Cross-platform Querying, Reporting, Analytics, and Data Federation Toad for Oracle and Toad for SQL Server have been and still are very popular among programmers. What many do not know is that Quest, now owned by Dell, has a very powerful set of business intelligence tools, including Toad Data Point and Toad Decision Point.

Toad Data Point is a cross-platform query and reporting tool designed for data analysts. It allows for visual query building and workflow automation. In addition, and this is a strong point of Quest's BI portfolio, it has very strong data connectivity capabilities. Besides being able to access many classic SQL databases through standard ODBC/JDBC drivers, it comes with a client version of Quest Data Services. Quest Data Services can be seen as a modern data federation product. It simplifies access to data sources and allows users to join data from multiple data sources, including data from Hadoop via Hive.

Toad Decision Point is a cross-platform reporting, analytical, and visualization tool designed for decision makers. It's a highly graphical environment. It also comes with a client version of Quest Data Services. However, with this version of Quest Data Services a wider range of data sources can be accessed, including applications such as Salesforce.com, and NoSQL data sources such as Hadoop, MongoDB, and Cassandra. Some of these NoSQL databases support non-relational concepts, such as super-columns. Quest Data Services makes it possible to access those non-relational concepts in a relational style. This opens up all the valuable data stored in these systems to a large set of tools, reports, and users.

The third core product of Quest's BI stack is Toad Intelligence Central which is a server version of Quest Data Services (the one supported by Toad Decision Point). Because it runs on the server, it makes data source access much more efficient. Because Toad Intelligence Central is server-oriented, users and reports can share specifications. The IT department can enter and manage these centralized specifications making it easier for users to develop and maintain their reports.

In short, the Quest BI portfolio is definitely worthwhile studying. Especially now that they have the backing of Dell itself, their product set could become a serious challenger to the established BI vendors.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2013/01/quest_bi_cross-.php http://www.b-eye-network.com/blogs/vanderlans/archives/2013/01/quest_bi_cross-.php Fri, 11 Jan 2013 08:28:34 -0700
Customer Question 9 on Data Virtualization - Have We Developed Rube Goldberg Machines?
For those not familiar with the concept or name, a Rube Goldberg machine, contraption, invention, device, or apparatus is a deliberately over-engineered or overdone machine that performs a very simple task in a very complex fashion, usually including a chain reaction--this is the definition used by Wikipedia. Examples of very simple tasks are pouring beer in a glass, opening up a door, or switching on a TV. Rube Goldberg was an American cartoonist and was most popular for drawing weird machines. Here you can find a photo showing an example of a Rube Goldberg machine.  On YouTube you can find numerous films showing such machines at work. One you have to see is the one developed by a young kid called Audri.

Why a discussion on these weird and often useless machines? Quite recently, I received an email from my customer with the following remark: "I have been thinking about the complexities of physical integration of our systems [with physical integration he means using classic ETL and duplicating data in several databases]. I wish I had Audri's YouTube video when I was trying to urge my team to consider data virtualization. After seeing that little boy, it feels as if developing and testing a system based on physical integration, is like trying to develop and test a Rube Goldberg machine."

He continues with "Knowing what I know now [after studying data virtualization more seriously], if data virtualization is comparable to using a remote control to turn a TV on and off, then physical integration is comparable to developing and using a Rube Goldberg machine to turn a TV on and off."

Evidently, this is an exaggeration, because there are still various situations for which you have to or want to deply a form of physical integration. But there is some truth in it. Software and hardware are currently so much more powerful than ten years ago. In fact, there is so much more "power" available that if organizations would have to design their current BI systems from scratch, they would probably come up with much simpler architectures, ones in which agility would be a fundamental design factor. Data virtualization would be one of the technologies that would clearly help to develop more agile BI systems.

So, years ago, when we designed the architectures of our BI systems, they were not considered Rube Goldberg machines. They were necessities, there was no other choice. But today there is. So, if we look at these architectures today, they do resemble Rube Goldberg machines. They are like machines in which the data values roll down a spiral, are thrown from one database to another, are changed occasionally, fall of some track sporadically, and sometimes even float a few inches, before they arrive in a report.

I have decided to use Audri's film from now on to explain what the differences are between developing BI systems with and without data virtualization.

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.
 ]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2012/11/customer_questi_8.php http://www.b-eye-network.com/blogs/vanderlans/archives/2012/11/customer_questi_8.php Mon, 19 Nov 2012 11:15:22 -0700
Customer Question 8 on Data Virtualization - Is It Immature Technology?
Probably because the term data virtualization is relatively young, some think it's also young technology and thus immature and only usable in small environments. This is a misunderstanding. Therefore, I decided to give a feeling of the long and rich history of data virtualization, making use of extracts from my book "Data Virtualization for Business Intelligence Systems."

Fundamental to data virtualization are the concepts abstraction and encapsulation. These concepts have their origin in the early 1970s. Exactly forty years ago, in 1972, David L. Parnas wrote a groundbreaking article "On the Criteria to be Used in Decomposing Systems into Modules." In this to me legendary article, Parnas explains how important it is that applications are developed in such a way that they become independent of the structure of the stored data. The big advantage of this concept is that if one changes, the other may not have to change. In addition, by hiding technical details, applications become easier to maintain, or to use more modern terms, they become more agile. Parnas calls this information hiding and worded it as follows: "... the purpose of [information] hiding is to make inaccessible certain details that should not affect other parts of a system."

Information hiding eventually became the basis for popular concepts, such as, object-orientation, component based development, and more currently service oriented architectures. All three have encapsulation and abstraction as foundation. No one questions the values of those three concepts anymore.

But Parnas was not the only one who saw the value of encapsulation and abstraction. The most influential paper in the history of data management "A Relational Model of Data for Large Shared Data Banks", written by E.F. Codd, founder of the relational model, started as follows: "Future users of large data banks must be protected from having to know how the data is organized [...] application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed." He used different terms as Parnas, but he clearly had the same vision. This illustrates one fundamental principle of computer science which is at the root of data virtualization: applications should be independent of the complexities of accessing data.

At the end of the 1970s, a concept called the three schema approach (or concept) was introduced and thoroughly researched. G.M. (Sjir) Nijssen was one of the driving forces behind most of the research in this area. Nijssen wrote numerous articles on this topic. Again, abstraction and encapsulation were the driving forces.

A personal note: In 1979 I started my IT career working for Nijssen in Brussels, Belgium, when all that research was going on. I didn't realize it at that time, but obviously data virtualization has played a role in my career from day one.

Technologically, data virtualization owes a lot to distributed database technology and federation servers. Most of the initial research for data federation was done by IBM in their famous System R* project which started way back in 1979. Another project that contributed heavily to distributed queries, was the Ingres project which eventually led to the open source SQL database server called Ingres, now distributed by Actian Corporation. System R* was a follow-up project to IBM's System R project--the birth place of SQL. Eventually, System R led to the development of most of IBM's commercial SQL database servers, including SQL/DS and DB2.

The forerunners of data virtualization servers can not be omitted here. The first products that deserve the label data federation server are IBM's DataJoiner and Information Builder's EDA/SQL (Enterprise Data Access). The former was introduced in the early 1990s and the latter in 1991. Both were not database servers, but were primarily products for integrating data from different data sources. Besides being able to access most SQL database servers, they were the first products to provide a SQL interface to non-SQL databases. Both products have matured and have undergone several name changes. After being part of IBM DB2 Information Integrator, DataJoiner is currently called IBM InfoSphere Federation Server, and EDA/SQL has been renamed into iWay Data Hub and is part of Information Builders' Enterprise Information Integration Suite.

I could list more technologies, research projects, and products that have been fundamental to the development of data virtualization, but I will stop here. This already impressive list clearly shows the long history and the serious amount of research that has gone into data virtualization and its forerunners. So, maybe the term data virtualization is young, but the technology definitely isn't. Therefore, classifying data virtualization as young and immature, would not be accurate.

Note: If you have questions related to data virtualization, send them in. I am more than happy to answer them.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2012/11/customer_questi_7.php http://www.b-eye-network.com/blogs/vanderlans/archives/2012/11/customer_questi_7.php Mon, 12 Nov 2012 07:50:20 -0700
Redefining Big Data BIG data stands for Business Intelligence Generated data. The reason for this proposal is that many are struggling with the term big data, myself included. There is a lot of confusion, because there is no generally accepted definition. We all know it's about large quantities of data, high velocity data, and/or a wide variety of data. But then still, what is a large quantity? When is it high and when low velocity? For some, big data is highly structured sensor data (machine generated data), for others it's textual unstructured data coming from social media, and there are those who say it's semi-structured data stored in, for example, weblogs.

The fact that the word big is a relative quantity doesn't help either. What big is for a midsize European company, can be medium for a large US company. And is it really about the amount of the data? Or is it more about what we do with it, for example, we analyze that data (regardless of the quantity). The V's (Volume, Velocity, Variety, Variation, Visibility, and Value--I've lost count of how many V's there are) are mentioned regularly to describe when something qualifies as big data.

Some have presented definitions, but I haven't seen an acceptable one yet. One author used the following definition: big data is data that is too much for a SQL database. This makes no sense. For example, there are plenty of multi-terabyte systems that everyone would classify as big data systems and that can be handled by SQL products more than satisfactorily.

Lastly, enough data is enough data. The quality of an analytical result doesn't always increase when the amount of data increases. Data quality is often more important than data quantity.

Conclusion, confusion rules when it relates to the concept of  big data.

In this blog I look at big data systems from a different angle in the hope that this helps to clarify this muddled concept.

Undeniably, processing large quantities of data is a common characteristic of most big data systems, but there is another one, and that is that most of such systems combine characteristics of production systems and of BI systems. In a sense each big data system is a production system, because it collects and stores new data, plus it's a BI system, because this new data is not collected to support business processes, but the primary intention is to use it for some form of analytics, possibly embedded analytics (analytics embedded within production systems), operational analytics, or predictive analytics. With new data I mean data that is not collected and stored by the organization yet, and in many cases it's also a new type of data. For example, a big data system developed by a retail company may be gathering camera data for tracking customer routes through their stores. Or, a big data system of a large international electronics firm may collect unstructured social media data for sentiment analysis.

Traditionally, new data is entered with and processed by production systems, such as a general ledger, cash management, and claim processing systems. These systems are, however, not designed to support analytics, but are designed to support business processes. In fact, when they were designed, the focus was definitely not on analytics, but on supporting data entry. This is why it's sometimes so hard when developing BI systems to extract the right data from those production databases for analytical and reporting purposes--staging areas have to be developed, ETL and replication processes have to be designed, and so on. This is still true today: the designers of new production systems don't think about how the organization can use the data for analytical purposes.

In other words, what makes big data systems special is that they are hybrid systems, they are production systems and BI systems. In my opinion, this is what makes big data applications special--and, evidently, most of them collect massive amounts of data to supports the required forms of analytics.

So, maybe we should redefine the term big data. Let's begin by not associating the word big with a relative quantity anymore, but let's change the word big to an acronym, so that BIG data stands for Business Intelligence Generated data--data generated and stored with the primary purpose to analyze it. Thus, a big data system is a system that generates, collects, stores, and processes data specifically to support business intelligence. Subsequently, big data is data managed by a big data system.

Hopefully, by redefining the term big data it becomes more obvious what is meant with this promising category of systems and gets rid of some of the confusion.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2012/10/redefining_big.php http://www.b-eye-network.com/blogs/vanderlans/archives/2012/10/redefining_big.php Tue, 16 Oct 2012 13:45:02 -0700
Customer Question 7 on Data Virtualization - What About Performance?
Their question: "Isn't data virtualization by definition slow, because it's an extra layer of software? And doesn't all the federation, integration, transformation, and cleansing of data that has to take place on-demand, slow down each query?"

This is a question I can't disregard in this series, because performance is an aspect that always worries people when they hear about data virtualization for the first time. In addition, I received a comment on the first blog in this series, which was related to performance.

There is much to say about the performance of data virtualization servers, but because this is a blog, I focus on the key issues.

First of all, some think that the performance of a data virtualization server is by definition poor, because it's accessing source production systems, and not a data warehouse, a data mart, or some other database that is designed and optimized for reporting. It's true that retrieving data from source systems can lead to performance problems. These systems may not have been designed or optimized to run BI queries, or the transaction workload they have to process is so intense that running queries on them as well, can cause serious interference. Therefore, in most cases this is not the recommended approach. A better approach is to design a data warehouse and let the data virtualization server access that data warehouse and not the production systems. Data virtualization does not exclude a data warehouse; also see my blog Do We Still Need A Data Warehouse?

Second, because data virtualization evolved from data federation, some think that data virtualization is only worthwhile when data from multiple data sources is retrieved and integrated--only useful when data is federated. Because data federation can be a resource hungry operation, it can therefore be slow. Evidently, all data virtualization servers do support data federation, and they have various techniques to optimize this federation process. But data virtualization servers are not only useful when data federation is needed. In many systems, data virtualization is used even when each query is a non-federated query. In this case, the strength of data virtualization is encapsulation (hiding all irrelevant technical details of the data stores) and abstraction (showing only relevant data, with the right structure, and on the right aggregation level).

Third, to make access to data sources as efficient as possible, many optimization techniques are implemented in data virtualization servers. For example, join optimization techniques, such as ship joins, query substitution, query pushdown, and query expansion, are implemented to make data access as efficient as possible. These techniques are all very mature and have proven their worth. In fact, research has been going on in this area since the days of the famous IBM's System R* and Ingres projects. Both projects started way back in the 1970s. And research continues--new techniques are still discovered to optimize data access.

Another example of a technique that improves performance is caching. With this technique, the contents of virtual tables (the key building blocks of data virtualization servers) are stored. This means that when a virtual table is accessed, the result is not retrieved from the underlying data sources, but from a cache. The effect is that access of the data sources is not required, data transformation doesn't have to take place, and no data has be integrated or cleansed. No, the data is ready to go. It's like picking up a pizza from a take-away restaurant where you phoned-in your order 30 minutes on forehand.

More and more data virtualization servers offer all kinds of features to store and access caches efficiently. For example, some allow the cache to be stored in the fastest analytical database servers available. It's to be expected that in the near future, data virtualization servers will also support in-memory database servers for storing caches. Undeniably, this will speed up query processing even more, because accessing those cache will involve no I/O.

To summarize, some have worries about the performance of data virtualization servers, but quite regularly those worries are based on the wrong assumptions, such as data virtualization excludes a data warehouse. Data virtualization servers offer enough optimization techniques to process the majority of today's queries fast. However, if you want to run queries that join historical data stored in a data warehouse with data coming from a sentiment analysis executed on textual information straight from Twitter, and join that with production data that still has to be cleansed heavily, then yes, you will have a performance challenge.

Note: For more information on data virtualization, such as query optimization and caching, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2012/10/customer_questi_6.php http://www.b-eye-network.com/blogs/vanderlans/archives/2012/10/customer_questi_6.php Mon, 15 Oct 2012 01:24:16 -0700
Customer Question 6 on Data Virtualization - Do We Still Need A Data Warehouse?
Question: "If we adopt data virtualization, can we throw away the data warehouse, because we can access the data in the production databases straight on, right?"

Wrong! Data virtualization is not some data warehouse killer. In most projects, where data virtualization is deployed, you will still need a data warehouse. In many systems, if no data warehouse it developed, it won't be possible to implement the information needs of many reports. Let me give the two key reasons:

  • Most production systems do not contain historical data. They were not designed to keep track of historical data. If a value is changed, the old value is deleted. For reports that need to do trend analysis, those deleted values may be needed. Thus, those values have to be stored somewhere. And this is where the data warehouse comes in: data warehouses are needed to store historical data.
  • Production systems may contain inconsistent data. One system may say that a customer is based in New York, while the other system indicates that he is based in Boston. Inconsistencies can't always be solved using software, sometimes human intervention is required to indicate what the correct value is. The result of that intervention must be stored somewhere, so that it can be reused. Again, that's where a data warehouse comes in.
And there are more reasons why an additional database is needed: the data warehouse. If that data warehouse would not exist, and if the data virtualization server is connected to the production systems, it would have no idea how to retrieve the historical data because it wouldn't exist, and it would not know how to determine which of the inconsistent values is the right one.

Worthwhile to mention is that if a data warehouse system consists of a data warehouse and deploys data virtualization, then (physical) data marts may not be needed anymore when they contain data derived from the data warehouse. Such data marts can be simulated by the data virtualization server. We usually refer to them as virtual data marts.

So, introducing data virtualization in a data warehouse system does not imply throwing away the data warehouse. The data warehouse is still needed.

Note: For more information on data virtualization, I refer to my new book "Data Virtualization for Business Intelligence Systems" available from Amazon.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2012/09/customer_questi_5.php http://www.b-eye-network.com/blogs/vanderlans/archives/2012/09/customer_questi_5.php Thu, 27 Sep 2012 12:04:19 -0700
Data Virtualization is About Productivity and Agility
The business need to improve the development speed and to increase agility is probably the reason why concepts such as agile BI and self-service BI have received so much attention lately. The Data Warehouse Institute recently identified five factors driving businesses toward self-service BI (2011 TDWI BI Benchmark Report: Organizational and Performance Metrics for Business Intelligence Teams) of which factors 1, 2, and 4, relate directly to productivity and agility:

  1. Constantly changing business needs (65%)
  2. IT's inability to satisfy new requests in a timely manner (57%)
  3. The need to be a more analytics-driven organization (54%)
  4. Slow and untimely access to information (47%)
  5. Business user dissatisfaction with IT-delivered BI capabilities (34%)
Many vendors of data virtualization servers have made statements claiming that deploying their products do lead to an increase of productivity and an improvement of agility in BI projects. But how? I will give two dominant reasons.

First, with data virtualization, reports are decoupled from the data sources. The effect is that it becomes easier to change the data storage solution without having to change the reports (or vice versa). For example, a generic SQL database server can be replaced by a faster analytical database server without having to change one line of code in the reports. Another example where decoupling is useful is when a reporting query is redirected away from a database that contains derived data to one that contains original data. This means that the derived data can be removed thus simplifying the entire system. Such a change is transparent to the reports. Simplification is (almost) always good for productivity and agility.

Second, if users need access to data they haven't used yet, making that data available to them using data virtualization is usually very easy, even if it concerns a NoSQL product or unstructured data. Maybe that first implementation is slow, but it works, and users can deploy it and may even experience a ROI. In parallel with the use of this first implementation, the IT department can look for a more efficient implementation (if required). Again, making changes afterwards, has no impact on the reports.

More reasons exist that explain why data virtualization makes BI systems more agile and productive, but I think these two are the main ones. For more on how DV increases productivity and agility, I refer to the webinar series entitled Agile Data Integration for BI Lecture Series.

]]>
http://www.b-eye-network.com/blogs/vanderlans/archives/2012/09/data_virtualiza_3.php http://www.b-eye-network.com/blogs/vanderlans/archives/2012/09/data_virtualiza_3.php Tue, 25 Sep 2012 02:49:18 -0700