Almost 30 years ago in 1984, John Cage1
of Sun Microsystems (acquired by Oracle in 2010) coined the phrase ďThe Network is the Computer.Ē He was right then, and he is even more right today. Nowadays, application processing is highly distributed over countless machines connected by a network. The boundaries between computers have completely blurred. We run applications that seamlessly invoke application logic on other machines.
But itís not only application processing that is scattered across many computers; the same can be said for data. More and more digitized data is entered, collected and stored in a distributed fashion. Itís stored in cloud applications, in outsourced ERP
systems, on remote websites and so on. In addition, external data is available from government, social media, news websites, and the number of valuable open data sources is staggering. The network is not only the computer anymore; the network has become the database as well:
The network is the database.
This dispersion of data is a fact. Still, data has to be integrated to become valuable for an organization. For long, the traditional solution for data integration has been to copy the data to a centralized site such as the data warehouse. However, data volumes are increasing (and not only because of the popularity of big data systems
). The consequence is that more and more often data has become too big to move (for performance, latency or financial reasons) Ė data has to stay where itís entered. For integration, instead of moving the data to the query processing (as in data warehouse systems), query processing must be moved to the data sources.
This article explains the problem of centralized consolidation of data and describes how data virtualization
helps to turn the network in a database using on-demand integration. It also explains the importance of distributed data virtualization
to operate efficiently in todayís highly networked environment.
A Short History Lesson
Once upon a time, all the digitized data of an enterprise was stored on a small number of disks managed by a few machines, all standing in the same computer room. Specialists in white coats monitored these machines and were responsible for making backups of the valuable data. Itís very likely that all the users were in the same building as well, accessing the data through monochrome monitors. The network that was used to move data between the machines was referred to as the sneakers-network
Then the time came when users started to roam the planet, and machines residing in different buildings were connected with real networks. Compared to today, these first generations of networks were just plain slow. For example, in the 1970s, Bob Metcalfe (co-inventor of Ethernet) built a high-speed network interface between MIT and ARPANET.2
This network supported a dazzling network bandwidth of 100 Kbps. Compare that with todayís 100 Gigabit Ethernet that offers a million times more bandwidth. In an optimized network environment, one terabyte of data can now be transferred within 80 seconds. This would have taken 2.5 years in the 1970s.
Because users were working on remote sites, accessing data involved transmitting data back and forth, and that was slow. The vendors of database servers tried to solve this problem by developing distributed database servers
in the 1980s.3,4
By applying replication
techniques, data was moved closer to the users to minimize network delay. With replication, data is copied to the nodes on the network where users are requesting data. To keep replicas up to date, distributed database servers support complex and innovative replication mechanisms.
Nowadays, itís no longer the computing room where new data is entered. Data is entered, collected and stored everywhere. Examples include:
Distributed collection: Websites running in the cloud collect millions of weblog records indicating visitor behavior. Factories operating worldwide run high-tech machines generating massive amounts of sensor data. Mobile devices collect data on application usage and track geographical locations.
Cloud applications: Applications such as Salesforce.com and NetSuite store enterprise data in the cloud.
Open data: Thousands and thousands of open data sources have become available for the public. Open data sources contain weather data, demographic data, energy consumption data, hospital performance data, public transport data, and the list goes on. Almost all these open data sources are stored somewhere in the cloud.
Outsourcing servers: The fact that many organizations run their own ERP applications and databases in the cloud has also led to a distribution of enterprise data. In fact, some organizations really have absolutely no clue anymore where their data is stored physically.
Personal data: Data created by individual users or small groups of users is stored far and wide. Itís available on their own machines, on mobile devices and in services such as Dropbox or Google Drive.
But itís not only that data is stored in a distributed fashion; data entry is distributed as well. Employees, customers and suppliers all enter data via the Internet, using their own machines at home, on their mobile devices and so on. Data entry has never been more dispersed.
To summarize this short history lesson, in the beginning data and users were centralized. Next, data stayed centralized, and users became distributed. Now data and users are both highly distributed.
The Need to Integrate Distributed Data Remains
As described, there are many good reasons why data entry and data storage are dispersed. Still, data has to be integrated. There are many different reasons why data has to be integrated:
- Customer care can be improved when sales data is integrated with complaint data and data from social media Ė all three being different data sources.
- Transport planning can be more efficient when internal packing and delivery data is integrated with weather and traffic data Ė both being external data sources.
- Internal sales data becomes more valuable to an organization when itís integrated with, for example, demographic data. It may explain why certain customers buy certain products. The sales data may be stored in an ERP system running on local servers, whereas the demographic data is available from an external website of which the physical location is completely unknown.
- To develop the right key performance indicators (KPIs), sales data must be integrated with manufacturing data.
- Another reason for integrating data is purely to be able to make sense of data. For example, sensor data coming out of high-tech machines can be highly cryptic and coded. The explanations of these codes may be stored in a database residing on another system. So, to make sense of the sensor data, it must be integrated.
It is obvious that the need to integrate data from different sources is important for every organization. And now that data has been distributed, data integration becomes an even bigger technological challenge.
Is Centralization the Answer to Data Integration?
For the last twenty years, the most popular solution to integrate data is the data warehouse. In most data warehouses systems, data from multiple sources is physically moved to and consolidated in one big database (one site). Here, the data is integrated, standardized and cleansed, and made available for reporting and analytics.
This centralization and consolidation of data makes a lot of sense from the perspective of the need to integrate data. And if there is not too much data, itís technically feasible. But can we keep doing this? Can we keep moving and copying data, especially in this era of big data? It looks as if the answer is going to be no, and for some organizations itís already a no. Here we list four problems of this approach:
- The first problem of this approach is the ever-growing amount of data. There is a reason why big data is the biggest trend in the IT industry. The word ďbigĒ says it all. Big data is about managing, storing and analyzing massive amounts of data. And sometimes big data can be too big to move. For some organizations, the amount of data generated each day is more than can be moved across the network (depending on the network characteristics). In such a situation, when data is moved to a central site for integration purposes, the network cables will start to look like snakes swallowing pigs.
- The second problem is caused by the growing importance of operational intelligence. Users want to work with zero-latency data (data that is 100% or close to 100% up to date). If data is first transmitted in large batches over the network and stored redundantly, there will always be a delay. When users demand operational intelligence, itís better to request data straight from the source.
- The third problem is related to privacy. More and more international legislation exists for storing data on individuals, such as customers, patients and visitors (see end note #5 for an example of such legislation). For example, storing data of individuals when they are not aware of it and havenít given the permission to do so is not legal. These rules are becoming tighter and tighter Ė rightfully so. This implies that when an organization needs access to demographic data on individual customers, they canít just copy and store that data in their own systems for integration purposes. This data must be used where itís stored.
- The fourth problem is related to the sheer amount of data. Consolidating big data is starting to become too expensive when stored in traditional SQL database servers.
Until now, centralization may have been the right approach for data integration, but as more data is entered and stored in a distributed fashion, it may not be the right solution in the near future. In the 1980s, distributed database technology moved data to the user, and for integration purposes data was moved to the point of query processing. Itís now time to move the query processing to the location where the data is collected. This minimizes network traffic duplication of stored data, and it lowers the risk that data will be inconsistent (or just plain incorrect) and/or out of date. If the mountain will not come to Mahomet, Mahomet must go to the mountain.
Data Virtualization to the Rescue Ė Moving Processing to the Data
But how can all the distributed data be integrated without copying it first to a centralized data store, such as a data warehouse? Data virtualization technology6
offers a solution. In a nutshell, data virtualization makes a heterogeneous set of data sources look like one logical database to the users and applications. These data sources donít have to be stored locally; they can be anywhere.
Data virtualization technology is designed and optimized to integrate data live. There is no need to physically store all the integrated data centrally. Itís only when data from several different sources is requested by users that itís integrated, but not before that. In other words, data virtualization supports integration on demand
Because data virtualization servers retrieve data from other systems, they must understand networks. They must know how to efficiently transmit data over the network to the server where the integration on demand takes place. For example, to minimize network traffic, mature data virtualization servers deploy so-called push-down techniques
. If a user asks for a small portion of a table, only that portion of the data is extracted by the data virtualization server from the data source and not the entire table. The query is ďpushed downĒ to the data source instead of requesting the entire table.
Push down allows a data virtualization server to move the processing to the data
instead of moving the data to the processing. In the latter case, all the data is transmitted to the data virtualization server that subsequently executes the request. Especially if big data sets are used, this approach would be slow because of the amount of network traffic involved. A preferred approach is to ship the query to the data source, and transmit only relevant data back to the data virtualization server.
The Need for Distributed Data Virtualization Ė Moving Processing Closer to the Data
Moving processing to the data is a powerful feature to optimize network traffic, but itís not sufficient for the distributed data world of tomorrow. Imagine that a data virtualization server runs on one server and all the requests for data are first moved to that central server, queries are sent to all the data sources, answers are transmitted back, and all the data is integrated and returned to all the users. This centralized processing of requests can be highly inefficient. It would be like a worldwide operating parcel service where all the parcels are first shipped to Denver, and from there to the destination address. If a specific parcel has to be shipped from New York to San Francisco, then this is not a bad solution. However, a parcel from New York to Boston is going to take an unnecessarily long time because of this detour via Denver. Or what about a parcel that must be shipped from Berlin, Germany, to London, UK? That parcel is going to make a long journey via Denver before it arrives in London.
Besides this inefficiency aspect, itís not recommended to have one data virtualization server because it lowers availability. If that server crashes, no one can get to the data anymore. It would be like the parcel service in a situation where the airport in Denver is closed because of bad weather conditions.
To address the new data integration workload, itís important that data virtualization servers support a highly distributed architecture. Each node in the network where queries originate and data sources reside should run a version of the data virtualization server for processing these requests. Each node of the data virtualization server that receives user requests should know where the requested data resides, and must push the request to the relevant data virtualization server. Multiple data virtualization servers work together to execute the request. The effect is that when no remote data is requested, no shipping of data and requests will take place.
This is only possible if a data virtualization server is knowledgeable about network aspects, such as what is the fastest network route, the cheapest network route, how to transmit data efficiently, the optimal package size, and so on. Like they must know how to optimize database access, they must also know how to optimize network traffic. It requires a close marriage of the network and data virtualization.
Note that this requirement to distribute data virtualization processing over countless nodes is not very different from the data processing architectures of NoSQL systems.
The Network is the Database
Data and data entry are more and more distributed over the network, and over time it will only escalate. The time that all the data is stored together is forever gone. Sun Microsystemsí tagline once was ďThe network is the Computer.Ē In this era, in which data is entered and stored everywhere, in which users who access the data can be everywhere, and in which big data systems are being developed, an analogous statement can be made:
The network is the database.
If the network is the database, copying all the data to one centralized node for integration purposes is expensive, almost technically undoable, and it may clash with regulations. Due to its integration-on-demand solution, data virtualization technology offers a more suitable approach to integrate all this widely dispersed data. Data virtualization will be the key instrument to integrating widely dispersed big data and turn ďthe network into a database.Ē A requirement will be that data virtualization servers have a highly decentralized architecture and are extremely network-aware. End Notes:
- Dave Edstrom, "The Network is the Computer," November 2012, see http://www.imts.com/show/newsletter/insider/article.cfm?aid=547
- Computer History Museum, A History of the Internet, see http://www.computerhistory.org/internet_history/internet_history_70s.html
- J.A. Larson and S. Rahimi, Tutorial: Distributed Database Management, IEEE Computer Society Press, 1985.
- M. Stonebraker, Readings in Database Systems, Morgan Kaufmann Publishers, 1988.
- Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data; see http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:en:NOT
- R.F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann Publishers, 2012.
Recent articles by Rick van der Lans