Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

(Editor's note: This is the second in a multi-part series on big data.)

The big data movement is revolutionary. It seeks to overturn cherished tenets in the world of data processing. And it's succeeding.

Most people define "big data" by three attributes: volume, velocity, and variety. This is a good start, but misses what's really new. (See "The Vagaries of the Three V's: What's Really New About Big Data".) At its heart, big data is about liberation and balance. It's about liberating data from the iron grip of software vendors and IT departments. And it's about establishing balance within corporate analytical architectures long dominated by top-down data warehousing approaches. Let me explain.

Liberating Data from Vendors

Most people focus on the technology of big data and miss the larger picture. They think big data software is uniquely designed to process, report, and analyze large data volumes. This is simply not true. You can do just about anything in a data warehouse that you can do in Hadoop, the major difference is cost and complexity for certain use cases.

For example, contrary to popular opinion on the big data circuit, many data warehouses can store and process unstructured and semi-structured content, execute analytical functions, and run processes in parallel across commodity servers. For instance, back in the 1990s (and maybe even still today), 3M drove its Web site via its Teradata data warehouse, which dynamically delivered Web content stored as blobs or external files. Today, Intuit and other customers of text mining tools, parse customer comments and other textual data into semantic objects that they store and query within a SQL database. Kelley Blue Book uses parallelized fuzzy matching algorithms baked into its Netezza data warehouse to parse, standardize, and match automobile transactions derived from auction houses and other data sources.

In fact, you can argue that Google and Yahoo could have used SQL databases to build their search indexes instead of creating Hadoop and MapReduce. However, they recognized that a SQL approach would have been wildly expensive because of database licensing costs due to the data volumes. They also realized that SQL isn't the most efficient way to parse URLs from millions of Web pages and such an approach would have jacked up engineering costs.

Open Source. The real weapon in the big data movement is not a technology or data processing framework, it's open source software. Rather than paying millions of dollars to Oracle or Teradata to store big data, companies can download Apache Hadoop and MapReduce for free, buy a bunch of commodity servers, and store and process all the data they want without having to pay expensive software licenses and maintenance fees and fork over millions to upgrade these systems when they need additional capacity.

This doesn't mean that big data is free or doesn't carry substantial costs. You still have to buy and maintain hardware and hire hard-to-find data scientists and Hadoop administrators. But the bottom line is that it's no longer cost prohibitive to store and process hundreds of terabytes or even petabytes of data. With big data, companies can begin to tackle data projects they never thought possible.

Counterpoint. Of course, this threatens most traditional database vendors who rely on a steady revenue stream of large data projects. They are now working feverishly to surround and co-opt the big data movement. Most have established interfaces to move data from Hadoop to their systems, preferring to keep Hadoop as a nice staging area for raw data, but nothing else. Others are rolling out Hadoop appliances that keep Hadoop in a subservient role to the relational database. Still others are adopting open source tactics and offering scaled down or limited use versions of their databases for free, hoping to lure new buyers and retain existing ones.

Of course, this is all good news for consumers. We now have a wider range of offerings to choose from and will benefit from the downward price pressure exerted by the availability of open source data management and analytics software. Money ultimately drives all major revolutions, and the big data movement is no different.

Liberating Data From IT

The big data revolution not only seeks to liberate data from software vendors, it wants to free data from the control of the IT department. Too many business users, analysts, and developers have felt stymied by the long-arm of the IT department or data warehousing team. They now want to overthrow these alleged "high priests" of data who have put corporate data under lock and key for architectural or security reasons or who take forever to respond to requests for custom reports and extracts. The big data movement gives business users the keys to the data kingdom, especially highly skilled analysts and developers.

Load and Go. The secret weapon in the big data arsenal is something that Amr Awadallah, founder and CTO at Cloudera, calls "schema at read." Essentially, this means that with Hadoop you don't have to model or transform data before you query it. This cultivates to a "load and go" environment where IT no longer stands between savvy analysts and the data. As long as analysts understand the structure and condition of the raw data and know how to write Java MapReduce code or use higher level languages like Pig or Hive, they can access and query the data without IT intervention (although they may need permission.)

For John Rauser, principal engineer at Amazon.com, Hadoop is a godsend. He and his team are using Hadoop to rewrite many data intensive applications that require multiple compute nodes. During his presentation at Strata Conference in New York City this fall, Rauser touted Hadoop's ability to handle myriad applications, both transactional and analytical, with both small and large data volumes. His message was that Hadoop promotes agility. Basically, if you can write MapReduce programs, you can build anything you want quickly without having to wait for IT. With Hadoop, you can move as fast or faster than the business.

This is a powerful message. And many data warehousing chieftains have tuned in. Tim Leonard, CTO of US Xpress loves Hadoop's versatility. He has already put it into production to augment his real-time data warehousing environment, which captures truck engine sensor data and transforms it into various key performance indicators displayed on near real-time dashboards. Also, a BI director for a well-known internet retailer uses Hadoop as a staging area for the data warehouse. He encourages his analysts to query Hadoop when they can't wait for data to be loaded into the warehouse or need access to detailed data in its raw, granular format.

Buyer Beware. To be fair, Hadoop today is a "buyer beware" environment. It is beyond the reach of ordinary business users, and even many power users. Today, Hadoop is agile only if you have lots of talented Java developers who understand data processing and an operations team with the expertise and time to manage Hadoop clusters. This steep expertise curve will diminish over time as the community refines higher level languages, like Hive and Pig, but even then, it's still pretty technical.

In contrast, a data warehouse is designed to meet the data needs of ordinary business users, although they may have to wait until the IT team finishes "baking the data" for general consumption. Unlike Hadoop, a data warehouse requires a data model that enforces data typing and referential integrity and ensures semantic consistency. It preprocesses data to detect and fix errors, standardize file formats, and aggregate and dimensionalize data to simplify access and optimize performance. Finally, data warehousing schemas present users with simple, business views of data culled from dozens of systems that are optimized for reporting and analysis. Hadoop simply doesn't do these things, nor should it.

Restoring Balance

Clearly, big data and data warehousing environments are very different animals, each designed to meet different needs. They are the yin and yang of data processing. One delivers agility, the other stability. One unleashes creativity, the other preserves consistency. Thus, it's disconcerting to hear some in the big data movement dismiss data warehousing as it's an inferior and antiquated form of data processing best preserved in a computer museum but not used for genuine business operations.

Every organization needs both Hadoop and data warehousing. These two environments need to work synergistically together. And it's not just that Hadoop should serve as a staging area for the data warehouse. That's today's architecture. Hadoop will grow beyond this to become a full-fledged reporting and analytical environment as well as a data processing hub. It will become a rich sandbox for savvy business analysts (whom we now call data scientists) to mine mountains of data for million dollar insights and answer unanticipated or urgent questions that the data warehouse is not designed to handle.

Summary

Thomas Jefferson once said, "The tree of liberty must be refreshed from time to time with the blood of patriots and tyrants." He was referring to the natural process by which political and social structures stagnate and stratify. This principle holds true for many aspects of life, including data processing.

For too long, we've tried to shoehorn all analytical pursuits into a single top-down data delivery environment that we call a data warehouse. This framework is now creaking and groaning from the strain of carrying too much baggage. It's time we liberate the data warehouse to do what it does best, which is deliver consistent, non-volatile data to business users to answer predefined questions and populate key performance indicators within standard corporate and departmental dashboards and reports.

It's gratifying to see Hadoop come along and shake up data warehousing orthodoxy. The big data movement helps clarify the strengths and limitations of data warehousing and underscore it role within an analytics architecture. And this leaves Hadoop and NoSQL technologies to do what they do best, which is provide a cost-effective, agile development environment for processing and query large volumes of unstructured data.

The organizations that figure out how to harmonize these environments will be the data champions of tomorrow.


Posted January 19, 2012 9:50 AM
Permalink | 4 Comments |

4 Comments

Hi Wayne,

You say here that "contrary to popular opinion on the big data circuit, many data warehouses can store and process unstructured and semi-structured content, execute analytical functions, and run processes in parallel across commodity servers" but in the first part, you cite "variety" as one of the 'V's and that "This would never fly in a data warehouse. The SQL would grok."

I understand that the software & engineering costs would be higher in the traditional SQL way; but the above statements seem to fight over the feasibility of using SQL for Big Data. Can you clarify?

thanks,
Navneeth

Don't forget to mention the role of cloud computing and cloud storage... That's the other way the Big Data is evolving. You don't even have to own your own servers..

Check out my prior blog on the three Vs. variety is a big differentiator. But just because you can do something in SQL doesn't mean you should. SQL databases can store unstructured data as blobs or as pointers to files outside the database. They can't manipulate the data but they can store and give users access to it. Even in Hadoop, the semi-structured data needs to be given some structure before it can be queried and manipulated rather than just accessed and displayed. Hope that helps.

I agree wholeheartedly. One of the fastest growing implementations of Hadoop is Amazon's Elastic MapReduce. Plus they offer Amazon DynamoDB, a NoSQL database.

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›