The Network is the Database Integrating Widely Dispersed Big Data with Data Virtualization
Originally published January 14, 2014
IntroductionAlmost 30 years ago in 1984, John Cage1 of Sun Microsystems (acquired by Oracle in 2010) coined the phrase ďThe Network is the Computer.Ē He was right then, and he is even more right today. Nowadays, application processing is highly distributed over countless machines connected by a network. The boundaries between computers have completely blurred. We run applications that seamlessly invoke application logic on other machines.
But itís not only application processing that is scattered across many computers; the same can be said for data. More and more digitized data is entered, collected and stored in a distributed fashion. Itís stored in cloud applications, in outsourced ERP systems, on remote websites and so on. In addition, external data is available from government, social media, news websites, and the number of valuable open data sources is staggering. The network is not only the computer anymore; the network has become the database as well:
This dispersion of data is a fact. Still, data has to be integrated to become valuable for an organization. For long, the traditional solution for data integration has been to copy the data to a centralized site such as the data warehouse. However, data volumes are increasing (and not only because of the popularity of big data systems). The consequence is that more and more often data has become too big to move (for performance, latency or financial reasons) Ė data has to stay where itís entered. For integration, instead of moving the data to the query processing (as in data warehouse systems), query processing must be moved to the data sources.
This article explains the problem of centralized consolidation of data and describes how data virtualization helps to turn the network in a database using on-demand integration. It also explains the importance of distributed data virtualization to operate efficiently in todayís highly networked environment.
A Short History LessonOnce upon a time, all the digitized data of an enterprise was stored on a small number of disks managed by a few machines, all standing in the same computer room. Specialists in white coats monitored these machines and were responsible for making backups of the valuable data. Itís very likely that all the users were in the same building as well, accessing the data through monochrome monitors. The network that was used to move data between the machines was referred to as the sneakers-network.
Then the time came when users started to roam the planet, and machines residing in different buildings were connected with real networks. Compared to today, these first generations of networks were just plain slow. For example, in the 1970s, Bob Metcalfe (co-inventor of Ethernet) built a high-speed network interface between MIT and ARPANET.2 This network supported a dazzling network bandwidth of 100 Kbps. Compare that with todayís 100 Gigabit Ethernet that offers a million times more bandwidth. In an optimized network environment, one terabyte of data can now be transferred within 80 seconds. This would have taken 2.5 years in the 1970s.
Because users were working on remote sites, accessing data involved transmitting data back and forth, and that was slow. The vendors of database servers tried to solve this problem by developing distributed database servers in the 1980s.3,4 By applying replication and partitioning techniques, data was moved closer to the users to minimize network delay. With replication, data is copied to the nodes on the network where users are requesting data. To keep replicas up to date, distributed database servers support complex and innovative replication mechanisms.
Nowadays, itís no longer the computing room where new data is entered. Data is entered, collected and stored everywhere. Examples include:
Distributed collection: Websites running in the cloud collect millions of weblog records indicating visitor behavior. Factories operating worldwide run high-tech machines generating massive amounts of sensor data. Mobile devices collect data on application usage and track geographical locations.But itís not only that data is stored in a distributed fashion; data entry is distributed as well. Employees, customers and suppliers all enter data via the Internet, using their own machines at home, on their mobile devices and so on. Data entry has never been more dispersed.
To summarize this short history lesson, in the beginning data and users were centralized. Next, data stayed centralized, and users became distributed. Now data and users are both highly distributed.
The Need to Integrate Distributed Data RemainsAs described, there are many good reasons why data entry and data storage are dispersed. Still, data has to be integrated. There are many different reasons why data has to be integrated:
Is Centralization the Answer to Data Integration?For the last twenty years, the most popular solution to integrate data is the data warehouse. In most data warehouses systems, data from multiple sources is physically moved to and consolidated in one big database (one site). Here, the data is integrated, standardized and cleansed, and made available for reporting and analytics.
This centralization and consolidation of data makes a lot of sense from the perspective of the need to integrate data. And if there is not too much data, itís technically feasible. But can we keep doing this? Can we keep moving and copying data, especially in this era of big data? It looks as if the answer is going to be no, and for some organizations itís already a no. Here we list four problems of this approach:
Data Virtualization to the Rescue Ė Moving Processing to the DataBut how can all the distributed data be integrated without copying it first to a centralized data store, such as a data warehouse? Data virtualization technology6 offers a solution. In a nutshell, data virtualization makes a heterogeneous set of data sources look like one logical database to the users and applications. These data sources donít have to be stored locally; they can be anywhere.
Data virtualization technology is designed and optimized to integrate data live. There is no need to physically store all the integrated data centrally. Itís only when data from several different sources is requested by users that itís integrated, but not before that. In other words, data virtualization supports integration on demand.
Because data virtualization servers retrieve data from other systems, they must understand networks. They must know how to efficiently transmit data over the network to the server where the integration on demand takes place. For example, to minimize network traffic, mature data virtualization servers deploy so-called push-down techniques. If a user asks for a small portion of a table, only that portion of the data is extracted by the data virtualization server from the data source and not the entire table. The query is ďpushed downĒ to the data source instead of requesting the entire table.
Push down allows a data virtualization server to move the processing to the data instead of moving the data to the processing. In the latter case, all the data is transmitted to the data virtualization server that subsequently executes the request. Especially if big data sets are used, this approach would be slow because of the amount of network traffic involved. A preferred approach is to ship the query to the data source, and transmit only relevant data back to the data virtualization server.
The Need for Distributed Data Virtualization Ė Moving Processing Closer to the DataMoving processing to the data is a powerful feature to optimize network traffic, but itís not sufficient for the distributed data world of tomorrow. Imagine that a data virtualization server runs on one server and all the requests for data are first moved to that central server, queries are sent to all the data sources, answers are transmitted back, and all the data is integrated and returned to all the users. This centralized processing of requests can be highly inefficient. It would be like a worldwide operating parcel service where all the parcels are first shipped to Denver, and from there to the destination address. If a specific parcel has to be shipped from New York to San Francisco, then this is not a bad solution. However, a parcel from New York to Boston is going to take an unnecessarily long time because of this detour via Denver. Or what about a parcel that must be shipped from Berlin, Germany, to London, UK? That parcel is going to make a long journey via Denver before it arrives in London.
Besides this inefficiency aspect, itís not recommended to have one data virtualization server because it lowers availability. If that server crashes, no one can get to the data anymore. It would be like the parcel service in a situation where the airport in Denver is closed because of bad weather conditions.
To address the new data integration workload, itís important that data virtualization servers support a highly distributed architecture. Each node in the network where queries originate and data sources reside should run a version of the data virtualization server for processing these requests. Each node of the data virtualization server that receives user requests should know where the requested data resides, and must push the request to the relevant data virtualization server. Multiple data virtualization servers work together to execute the request. The effect is that when no remote data is requested, no shipping of data and requests will take place.
This is only possible if a data virtualization server is knowledgeable about network aspects, such as what is the fastest network route, the cheapest network route, how to transmit data efficiently, the optimal package size, and so on. Like they must know how to optimize database access, they must also know how to optimize network traffic. It requires a close marriage of the network and data virtualization.
Note that this requirement to distribute data virtualization processing over countless nodes is not very different from the data processing architectures of NoSQL systems.
The Network is the DatabaseData and data entry are more and more distributed over the network, and over time it will only escalate. The time that all the data is stored together is forever gone. Sun Microsystemsí tagline once was ďThe network is the Computer.Ē In this era, in which data is entered and stored everywhere, in which users who access the data can be everywhere, and in which big data systems are being developed, an analogous statement can be made:
The network is the database.If the network is the database, copying all the data to one centralized node for integration purposes is expensive, almost technically undoable, and it may clash with regulations. Due to its integration-on-demand solution, data virtualization technology offers a more suitable approach to integrate all this widely dispersed data. Data virtualization will be the key instrument to integrating widely dispersed big data and turn ďthe network into a database.Ē A requirement will be that data virtualization servers have a highly decentralized architecture and are extremely network-aware.
SOURCE: The Network is the Database
Recent articles by Rick van der Lans
Copyright 2004 — 2014. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC