Originally published September 23, 2010
Imagine a geologist who gathers data from a volcano in Hawaii twice a year for troubling indicators of an impending eruption. Over the years, this scientist would collect quite a heap of data on this volcano. But what would happen if, instead of sending the scientist out twice a year, sensors were installed to measure things like temperature and vibrations every five minutes? The volume of data collected would grow over 100,000 fold.
There are a lot of troubling statistics about growing data volumes in the enterprise and the cost of storage, but sensors are poised to add more data to the scene than any other technology or sector. The cost of microsensors is plummeting, intelligent devices are expected to grow nearly six-fold by 2013, and automated data collection of the physical world is heralding what some call “the Internet of things.” IBM would call it a “smarter planet.”
Whether it’s millions of RFID tags in Walmart’s supply chain, a network of sensors monitoring the nation’s water reserves or – over the next decade – sensors that control cars and traffic, huge volumes of streaming data on the real world create interesting and powerful new applications of analytics, but will be utterly crippling for conventional relational databases or RDBMSs.
A large manufacturing plant could be collecting information such as temperature, pressure and humidity from hundreds of different key points in the plant every minute, resulting in a fire hose of data much larger and faster than anything that a human being would measure and input traditionally. To detect anomalies or safety issues early, data would need to be processed in near real time.
Moreover, if sensors are creating a snapshot of data every minute, even a data table for a single measurement would have more than 500,000 rows after a year or five million rows in ten years. Without any built-in intelligence, each query would be forced to examine every row.
As pointed out by Dr. Michael Stonebraker in “Data Torrents and Rivers” (IEEE SPECTRUM, September 2006), the overhead accruing from the multiuser switching and persistent storage processes of conventional RDBMSs is prohibitive and “A new class of system software … is required.” However, we’ve demonstrated that the overhead of conventional RDBMSs is not an inherent consequence of the relational data model, but only of their historical design limitations.
Unfortunately, despite incredible maintenance burdens, laggard performance, and incompatibility with a growing portion of data that’s unstructured, we insist on using conventional RDBMSs for advanced analytics when it’s indelibly clear they will never be able to handle the real-time data feeds of the future.
Despite the shortcomings of the current model, the majority of IT executives have great difficulty imagining a corporate database that’s not managed by a conventional RDBMS. Even as we hire a growing staff to manage the data and spend millions on hardware and software, we’re latched on to the 30-year old relational data model for applications far beyond their capabilities.
I believe the proliferation of sensors may be what finally breaks the camel’s back. The volume, speed and accessibility requirements will be so incredibly high that an army of thousands of data managers could not even make a dent on structuring a database that houses street-level information from “smart cities” in an entire state. No conventional RDBMS will be able to process it at any price. We will have to learn a new trick – and it’s about time.