Originally published November 19, 2008
Saturday June 7, 2008, was the 25th anniversary of DB2, and IBM was celebrating the anniversary at its recent Information on Demand (IOD) conference in Las Vegas in October. The anniversary was particularly poignant for me because it was DB2 that brought me to the United States in 1980 from the UK. My role back then was to work with the early customer DB2 test sites around the world. I left IBM in 1983, just as IBM DB2 was released, but it has been interesting to follow the progress of the product and its competitors over the past quarter century.
While at the IOD conference, I was surprised to discover new IBM projects using database techniques that were somewhat removed from traditional relational technology. I have noticed during recent product briefings that other vendors are also using alternatives to existing relational database technologies. This article explores why this trend is happening and discusses its likely impact on the future of relational systems.
Although relational database systems had a shaky start, their use has steadily grown over the years to dominate all aspects of IT processing from high-volume transaction applications to large-scale data warehouses. Today, relational products represent an $18.6 billion a year industry,1 and to quote Chris Date (who popularized Dr. E. F. Codd’s research work on the relational model), “[It’s] my opinion that the relational model is rock solid, and right, and will endure. I fully expect database systems still to be focused on Codd’s relational model.”2 I think most database specialists would agree with Chris – there has never been a sound reason to suppose that relational products would be superseded any time soon. Why then are we seeing other options appearing?
Let’s try to answer this question by reviewing the development of relational products over the past quarter century, and look at how relational technology is used in the present IT environment.
It is important to remember during this discussion that the relational model (in its simplest terms) is a logical model where data is stored in rows (tuples) and columns (attributes) of tables (relations), and manipulated by relational operators such as restriction, projection and join. The model does not define how the data is physical stored and managed on disk, or how the relational operators are implemented in a data manipulation language.
IBM DB2 and competing relational database products, such as Oracle Database and Microsoft SQL Server, have undergone significant development over the course of the last 25 years. Although all of these products support the relational model and implement relational operators in the SQL data manipulation language, their adherence to the relational model, and the dialect of SQL they provide, varies considerably. The SQL differences between products have become so significant as to make migrating between products extremely difficult.
The SQL Standard was supposed enable easy migration between relational database products by providing a common dialect of SQL. The standard, however, has evolved from offering a useful SQL subset, to one that requires considerable expertise to fully understand. The latest incarnation of the SQL standard (SQL:2008) is so big that it is unlikely any product will ever fully implement all of its capabilities. Many of its features are not relational at all. There are also several SQL extensions (such as SQL/XML) that further add to the scope of the language. The actual SQL features supported by products vary, and this coupled with numerous proprietary capabilities added by vendors means that the standard has less significance today than it did a few years ago. Adherence to the SQL standard has become almost meaningless.
Another difference between relational database products is how they physically store and manage data. Features such as bitmapped indexes, partitioned databases, columnar databases, parallel processing, in-memory databases, materialized views, improved optimization and so forth have enabled database vendors to distinguish their products from competitors. Many of these features are aimed at improving the performance for large-scale implementations.
A recent focus area in relational database development (and another area of incompatibility between products) has been support for more complex data types. Examples here include XML data, spatial data and multimedia content. Several products also allow developers to add their own data types.
About 80% of information in an organization is unstructured, and there is increasing interest in storing and analyzing this information in relational systems. This is often done by first converting the unstructured information into an XML format.
The major relational database systems have been providing XML support for several years. This has been done by mapping an XML document to a single large object data type, or by shredding the document into a relational schema. These solutions have inherent functional and performance constraints. A large object data type allows for fast insert and retrieval of full documents, but suffers from poor search and extract performance. Shredding XML to relational tables is expensive at insert time, but enables faster SQL queries and updates. The problem with shredding, however, is that XML documents can have many nested and repeating elements, and mapping from XML to a relational schema is a complicated task.
To solve problems with handling XML data, relational database products have starting adding support for XML-based storage engines and query languages. The latest release of IBM DB2, for example, supports a separate storage engine for XML data. The native XML storage is complemented with XML schema, index, and utility support. Data can be accessed using SQL/XML and XQuery.
SQL/XML enables document retrieval from the native XML storage. It also provides SQL applications with sub-document level search and extract capabilities by allowing XQuery or XPath functions in SQL statements. For a SQL programmer, SQL/XML is easy to learn because it involves additions to the existing SQL language.
XQuery is a XML query language defined by the World Wide Web Consortium (see www.w3.org/TR/xquery for more details). It is suited for queries that need to publish results as XML, to query XML stored inside or outside the database, or to span relational and XML sources. XQueries can optionally contain SQL statements to combine and correlate XML information with relational data
Relational products have experienced competition over the years. In the early days there was a brief skirmish with object database systems, but this quickly fizzled out, and relational systems prevailed.
Multidimensional database systems, however, have competed successfully with relational products for some time by supporting so-called online analytical processing, or OLAP. Oracle Essbase (acquired from Hyperion Software) is a good example here. Initially, these multidimensional products had their own proprietary OLAP manipulation languages, but this was often an inhibitor to their adoption. The move by products toward supporting the XML for Analysis (XMLA) mitigated this issue.
XMLA is a de facto industry standard that allows applications to talk to multidimensional data sources. The communication of messages back and forth is done using web standards (HTTP, SOAP and XML). The query language used is mdXML, which is based on Microsoft’s Multidimensional Expressions (MDX) language. While MDX is not an open standard, but rather a Microsoft owned specification, a wide range of OLAP vendors have adopted it.
MDX syntax appears, at first glance, to be similar to that of SQL. The principal difference between SQL and MDX is the ability of MDX to reference multiple dimensions. Although it is possible to use SQL exclusively to query cubes, MDX provides commands that are designed specifically to retrieve data as multidimensional data structures with any number of dimensions.
To compete with the multidimensional database products, relational database systems introduced cubing engines that extracted data from relational tables and presented the results to applications in a multidimensional format. IBM InfoSphere Warehouse Cubing Services, for example, is a multidimensional analysis server that enables applications to access data stored in DB2 tables using OLAP capabilities. It is based on technology IBM acquired from Alphablox, and provides cross-dimensional calculations, and time series and parallel period analyses. Dimensional navigation allows slice, dice, drill and pivot operations.
Operational business intelligence is a hot topic, and perhaps one aspect of this topic that affects the development of database processing is the need for analyzing data in real time. One way of doing this is to analyze data in flight or in motion, rather than data at rest, which is typically the case with data warehousing.
One of the first companies to support the analysis of in-flight data was StreamBase, a company found by Dr. Michael Stonebraker who is a key researcher in the field of database technology. Analysis in StreamBase is done using StreamSQL, which extends SQL to support the processing of real-time data streams. StreamSQL retains the capabilities of SQL while adding new features such as windowed relational operators, the ability to mix stored data with streaming data and the power to add custom logic, such as analytic functions.
Several other vendors now support real-time analysis, and there is an industry move toward establishing StreamSQL as a standard. Companies involved in this effort include Coral8, Oracle, StreamBase and Truviso. What makes stream computing valuable is the ability to create predictive information – the effect a particular hurricane may have on the trading of oil, for example.
IBM is also involved in real-time data analysis (including the effort to standardize StreamSQL). It recently announced several products in this area including InfoSphere Streams, which is based on the System S research project. To quote IBM, “System S provides an execution platform and services for user-developed applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams. It supports the composition of new applications in the form of stream processing graphs that can be created on the fly, mapped to a variety hardware configurations, and adapted as requests come and go, and relative priorities shift. System S is designed to scale from a single processing node, to high performance clusters of hundreds of processing nodes” (see domino.research.ibm.com/comm/research_projects.nsf/pages/esps.index.html for more details). It is interesting to note that the research project that evolved into DB2 was known as System R!
InfoSphere Streams does not support StreamSQL, but instead comes with a development language (and compiler) called SPADE (stream processing application declarative engine) that has been specifically developed for processing streams. SPADE comes with a toolkit of commonly used operators such as windowed relational operators and constructs to specify user-defined operators.
InfoSphere Streams is part of IBM’s business event processing (BEP) effort, which includes acquisitions such as AptSoft (software for complex event processing), Cognos Now! (operational BI technology), InfoDyne (analysis of low latency information), and Solid Information Technology (in-memory data caching and analysis).
The previous discussion highlights the direction of the industry toward supporting the processing and analysis of high volume information involving a wide range of different types of data. The amount of computing power required to do this can be considerable.
One area where these volumes and types of information can already be found is on the World Wide Web. Google, for example, handles the analysis of huge volumes of data on a daily basis. To support this processing and analysis of this volume of data, Google created a new software framework known as MapReduce, which supports parallel computations over large multiple petabyte files on clusters of computers (see en.wikipedia.org/wiki/MapReduce for more details). MapReduce divides applications into many small blocks of work. Google has implemented hundreds of MapReduce programs and, upwards of one thousand MapReduce jobs are executed on Google's clusters every day (see labs.google.com/papers/mapreduce.html for more details).
Several vendors have adapted this framework for supporting cloud computing, and MapReduce implementations have been written in C++, Java, Python and other languages. Google and IBM are teaming up to build large data centers to power a grid computing initiative based on MapReduce for university research and courses in distributed computer programming.
The joint IBM and Google project uses Hadoop, which is an open source distributed computing platform written in Java that implements MapReduce using the Hadoop Distributed File System (HDFS). HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
Hadoop has been demonstrated to run on clusters with several thousand nodes. Yahoo has been the largest contributor to the project and uses Hadoop extensively in its Web search and advertising businesses. Hadoop was created by Doug Cutting (now a Yahoo employee) who named it after his child's stuffed elephant (see en.wikipedia.org/wiki/Hadoop for more details).
Another IBM research vehicle involving MapReduce and Hadoop is Cloud9, which is a scalable data analytics and data mining platform for cloud computing. It uses a new high-level query language for JSON data known as Jaql. This is being used to study and improve the performance and optimization of MapReduce processing.
Another relational database company that is supporting MapReduce is Greenplum. Greenplum MapReduce enables programmers to run analytics against large-scale datasets stored in and outside of the Greenplum Database. It will execute both MapReduce and SQL directly within Greenplum’s parallel dataflow engine, which is at the heart of the Greenplum Database. To quote the company, “Greenplum MapReduce brings the benefits of a growing standard programming model to the reliability and familiarity of the relational database.”
Readers can see there are significant development efforts in the database industry for supporting analytical processing against huge volumes and complex information involving not only structured, but also unstructured data. Some of this work involves extending relational systems to support hybrid approaches, while other efforts offer alternatives to existing products. These efforts will not affect the developers of traditional operational and analytical applications; but for companies on the leading edge, the new technologies presented here are not those we have become familiar with over the past 25 years.
Recent articles by Colin White