(Editor's note: This is the second in a multi-part series on big data.)
The big data movement is revolutionary. It seeks to overturn cherished tenets in the world of data processing. And it's succeeding.
Most people define "big data" by three attributes: volume, velocity, and variety. This is a good start, but misses what's really new. (See "The Vagaries of the Three V's: What's Really New About Big Data".) At its heart, big data is about liberation and balance. It's about liberating data from the iron grip of software vendors and IT departments. And it's about establishing balance within corporate analytical architectures long dominated by top-down data warehousing approaches. Let me explain.
Liberating Data from Vendors
Most people focus on the technology of big data and miss the larger picture. They think big data software is uniquely designed to process, report, and analyze large data volumes. This is simply not true. You can do just about anything in a data warehouse that you can do in Hadoop, the major difference is cost and complexity for certain use cases.
For example, contrary to popular opinion on the big data circuit, many data warehouses can store and process unstructured and semi-structured content, execute analytical functions, and run processes in parallel across commodity servers. For instance, back in the 1990s (and maybe even still today), 3M drove its Web site via its Teradata data warehouse, which dynamically delivered Web content stored as blobs or external files. Today, Intuit and other customers of text mining tools, parse customer comments and other textual data into semantic objects that they store and query within a SQL database. Kelley Blue Book uses parallelized fuzzy matching algorithms baked into its Netezza data warehouse to parse, standardize, and match automobile transactions derived from auction houses and other data sources.
In fact, you can argue that Google and Yahoo could have used SQL databases to build their search indexes instead of creating Hadoop and MapReduce. However, they recognized that a SQL approach would have been wildly expensive because of database licensing costs due to the data volumes. They also realized that SQL isn't the most efficient way to parse URLs from millions of Web pages and such an approach would have jacked up engineering costs.
Open Source. The real weapon in the big data movement is not a technology or data processing framework, it's open source software. Rather than paying millions of dollars to Oracle or Teradata to store big data, companies can download Apache Hadoop and MapReduce for free, buy a bunch of commodity servers, and store and process all the data they want without having to pay expensive software licenses and maintenance fees and fork over millions to upgrade these systems when they need additional capacity.
This doesn't mean that big data is free or doesn't carry substantial costs. You still have to buy and maintain hardware and hire hard-to-find data scientists and Hadoop administrators. But the bottom line is that it's no longer cost prohibitive to store and process hundreds of terabytes or even petabytes of data. With big data, companies can begin to tackle data projects they never thought possible.
Counterpoint. Of course, this threatens most traditional database vendors who rely on a steady revenue stream of large data projects. They are now working feverishly to surround and co-opt the big data movement. Most have established interfaces to move data from Hadoop to their systems, preferring to keep Hadoop as a nice staging area for raw data, but nothing else. Others are rolling out Hadoop appliances that keep Hadoop in a subservient role to the relational database. Still others are adopting open source tactics and offering scaled down or limited use versions of their databases for free, hoping to lure new buyers and retain existing ones.
Of course, this is all good news for consumers. We now have a wider range of offerings to choose from and will benefit from the downward price pressure exerted by the availability of open source data management and analytics software. Money ultimately drives all major revolutions, and the big data movement is no different.
Liberating Data From IT
The big data revolution not only seeks to liberate data from software vendors, it wants to free data from the control of the IT department. Too many business users, analysts, and developers have felt stymied by the long-arm of the IT department or data warehousing team. They now want to overthrow these alleged "high priests" of data who have put corporate data under lock and key for architectural or security reasons or who take forever to respond to requests for custom reports and extracts. The big data movement gives business users the keys to the data kingdom, especially highly skilled analysts and developers.
Load and Go. The secret weapon in the big data arsenal is something that Amr Awadallah, founder and CTO at Cloudera, calls "schema at read." Essentially, this means that with Hadoop you don't have to model or transform data before you query it. This cultivates to a "load and go" environment where IT no longer stands between savvy analysts and the data. As long as analysts understand the structure and condition of the raw data and know how to write Java MapReduce code or use higher level languages like Pig or Hive, they can access and query the data without IT intervention (although they may need permission.)
For John Rauser, principal engineer at Amazon.com, Hadoop is a godsend. He and his team are using Hadoop to rewrite many data intensive applications that require multiple compute nodes. During his presentation at Strata Conference in New York City this fall, Rauser touted Hadoop's ability to handle myriad applications, both transactional and analytical, with both small and large data volumes. His message was that Hadoop promotes agility. Basically, if you can write MapReduce programs, you can build anything you want quickly without having to wait for IT. With Hadoop, you can move as fast or faster than the business.
This is a powerful message. And many data warehousing chieftains have tuned in. Tim Leonard, CTO of US Xpress loves Hadoop's versatility. He has already put it into production to augment his real-time data warehousing environment, which captures truck engine sensor data and transforms it into various key performance indicators displayed on near real-time dashboards. Also, a BI director for a well-known internet retailer uses Hadoop as a staging area for the data warehouse. He encourages his analysts to query Hadoop when they can't wait for data to be loaded into the warehouse or need access to detailed data in its raw, granular format.
Buyer Beware. To be fair, Hadoop today is a "buyer beware" environment. It is beyond the reach of ordinary business users, and even many power users. Today, Hadoop is agile only if you have lots of talented Java developers who understand data processing and an operations team with the expertise and time to manage Hadoop clusters. This steep expertise curve will diminish over time as the community refines higher level languages, like Hive and Pig, but even then, it's still pretty technical.
In contrast, a data warehouse is designed to meet the data needs of ordinary business users, although they may have to wait until the IT team finishes "baking the data" for general consumption. Unlike Hadoop, a data warehouse requires a data model that enforces data typing and referential integrity and ensures semantic consistency. It preprocesses data to detect and fix errors, standardize file formats, and aggregate and dimensionalize data to simplify access and optimize performance. Finally, data warehousing schemas present users with simple, business views of data culled from dozens of systems that are optimized for reporting and analysis. Hadoop simply doesn't do these things, nor should it.
Clearly, big data and data warehousing environments are very different animals, each designed to meet different needs. They are the yin and yang of data processing. One delivers agility, the other stability. One unleashes creativity, the other preserves consistency. Thus, it's disconcerting to hear some in the big data movement dismiss data warehousing as it's an inferior and antiquated form of data processing best preserved in a computer museum but not used for genuine business operations.
Every organization needs both Hadoop and data warehousing. These two environments need to work synergistically together. And it's not just that Hadoop should serve as a staging area for the data warehouse. That's today's architecture. Hadoop will grow beyond this to become a full-fledged reporting and analytical environment as well as a data processing hub. It will become a rich sandbox for savvy business analysts (whom we now call data scientists) to mine mountains of data for million dollar insights and answer unanticipated or urgent questions that the data warehouse is not designed to handle.
Thomas Jefferson once said, "The tree of liberty must be refreshed from time to time with the blood of patriots and tyrants." He was referring to the natural process by which political and social structures stagnate and stratify. This principle holds true for many aspects of life, including data processing.
For too long, we've tried to shoehorn all analytical pursuits into a single top-down data delivery environment that we call a data warehouse. This framework is now creaking and groaning from the strain of carrying too much baggage. It's time we liberate the data warehouse to do what it does best, which is deliver consistent, non-volatile data to business users to answer predefined questions and populate key performance indicators within standard corporate and departmental dashboards and reports.
It's gratifying to see Hadoop come along and shake up data warehousing orthodoxy. The big data movement helps clarify the strengths and limitations of data warehousing and underscore it role within an analytics architecture. And this leaves Hadoop and NoSQL technologies to do what they do best, which is provide a cost-effective, agile development environment for processing and query large volumes of unstructured data.
The organizations that figure out how to harmonize these environments will be the data champions of tomorrow.
Posted January 19, 2012 9:50 AM
Permalink | 4 Comments |