(Editor's note: this is the first article in a multi-part series on big data.)
Most people define "big data" by three attributes: volume, velocity, and variety. These describe the main characteristics of big data, but aren't exclusive to it. Many data warehouses today exhibit these same characteristics. This article drills into these attributes and shows what's common and not between data warehousing and big data environments.
Volume. Data volume is a slippery term. Many observers have noted this. What's large for some organizations is small for others. So, experts now define the term as "data that is no longer easy to manage." This is still pretty squishy as far as definitions go.
From a historical context, data warehousing has always been about "big data." The real difference is scale and scope, which have been growing steadily for years. In the 1990s, high-end data warehouses contained hundreds of gigabytes and then terabytes. Today, they have breached the petabyte range, and surely will ascend to exabytes sometime in the future.
Does that make a data warehouse a big data initiative? Not really. The big data movement today is largely about using open source data management software to cost effectively capture, store, and process semi-structured Web log data for a variety of tasks. (See "Let the Revolution Begin: Big Data Liberation Theology.") While data warehousing is focused solely on delivering structured data for reporting and analysis applications, the big data movement has broader implications. Hadoop and NoSQL can manage any type of data (structured, semi-structured, and unstructured) for virtually any type of application (analytical or transactional.)
Velocity. If you have big data, by default you have to load it in real-time using streaming or mini-batch load intervals. Otherwise, you can never keep up. This is nothing new. Most data warehousing teams have already converted from weekly and nightly batch refreshes to mini-batch cycles of 15 minutes or less that insert only deltas using change data capture and trickle feeding techniques. Hadoop and NoSQL databases are also evolving from batch loading of data to streaming it in real-time.
Some organizations also embrace real-time loading to meet operational business needs. For example, 1-800 CONTACTS displays orders and revenues in a data warehouse-driven dashboard updated every 15 minutes. US Xpress tracks idle time of its trucks by capturing sensor data from truck engines fed into a data warehouse that drives several real-time dashboards. Currently, most big data installations don't support real-time reporting environments, but the technology is evolving fast and this capability will soon become standard fare.
Variety. Variety generally refers to the ability to capture, store, and process multiple types of data. This is perhaps the biggest differentiator between data warehousing and big data environments. Hadoop is agnostic about data type and format. Just dump your data into a file and then write a Java program to get it out. For example, a Hadoop cluster can store Twitter and Facebook data, audio and video, documents and transactions, and so on.
In addition, the same Hadoop file can contain a jumble of different records--or key value pairs--each representing different entities or attributes. Although you can also do this in a columnar database, it's standard fare for Hadoop. (This is the "complexity" attribute that some industry observers add as a fourth attribute of big data.) Mixing record types puts the onus on the developer/analyst to sort through the records to find only the ones they want, which presumes foreknowledge about record types and identifiers. This would never fly in a data warehouse. The SQL would grok.
The three Vs provide a reasonable map to the big data landscape as long as you don't dig too deeply into the details. There, you'll find there is considerable overlap with traditional data warehousing techniques. The real difference between the two environments is that big data is better suited to handling a variety of data (i.e., unstructured and complex data) than a data warehouse which is designed to work with standardized, non-volatile data.
Posted January 19, 2012 9:43 AM
Permalink | 1 Comment |