Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

January 2012 Archives

(Editor's note: This is the second in a multi-part series on big data.)

The big data movement is revolutionary. It seeks to overturn cherished tenets in the world of data processing. And it's succeeding.

Most people define "big data" by three attributes: volume, velocity, and variety. This is a good start, but misses what's really new. (See "The Vagaries of the Three V's: What's Really New About Big Data".) At its heart, big data is about liberation and balance. It's about liberating data from the iron grip of software vendors and IT departments. And it's about establishing balance within corporate analytical architectures long dominated by top-down data warehousing approaches. Let me explain.

Liberating Data from Vendors

Most people focus on the technology of big data and miss the larger picture. They think big data software is uniquely designed to process, report, and analyze large data volumes. This is simply not true. You can do just about anything in a data warehouse that you can do in Hadoop, the major difference is cost and complexity for certain use cases.

For example, contrary to popular opinion on the big data circuit, many data warehouses can store and process unstructured and semi-structured content, execute analytical functions, and run processes in parallel across commodity servers. For instance, back in the 1990s (and maybe even still today), 3M drove its Web site via its Teradata data warehouse, which dynamically delivered Web content stored as blobs or external files. Today, Intuit and other customers of text mining tools, parse customer comments and other textual data into semantic objects that they store and query within a SQL database. Kelley Blue Book uses parallelized fuzzy matching algorithms baked into its Netezza data warehouse to parse, standardize, and match automobile transactions derived from auction houses and other data sources.

In fact, you can argue that Google and Yahoo could have used SQL databases to build their search indexes instead of creating Hadoop and MapReduce. However, they recognized that a SQL approach would have been wildly expensive because of database licensing costs due to the data volumes. They also realized that SQL isn't the most efficient way to parse URLs from millions of Web pages and such an approach would have jacked up engineering costs.

Open Source. The real weapon in the big data movement is not a technology or data processing framework, it's open source software. Rather than paying millions of dollars to Oracle or Teradata to store big data, companies can download Apache Hadoop and MapReduce for free, buy a bunch of commodity servers, and store and process all the data they want without having to pay expensive software licenses and maintenance fees and fork over millions to upgrade these systems when they need additional capacity.

This doesn't mean that big data is free or doesn't carry substantial costs. You still have to buy and maintain hardware and hire hard-to-find data scientists and Hadoop administrators. But the bottom line is that it's no longer cost prohibitive to store and process hundreds of terabytes or even petabytes of data. With big data, companies can begin to tackle data projects they never thought possible.

Counterpoint. Of course, this threatens most traditional database vendors who rely on a steady revenue stream of large data projects. They are now working feverishly to surround and co-opt the big data movement. Most have established interfaces to move data from Hadoop to their systems, preferring to keep Hadoop as a nice staging area for raw data, but nothing else. Others are rolling out Hadoop appliances that keep Hadoop in a subservient role to the relational database. Still others are adopting open source tactics and offering scaled down or limited use versions of their databases for free, hoping to lure new buyers and retain existing ones.

Of course, this is all good news for consumers. We now have a wider range of offerings to choose from and will benefit from the downward price pressure exerted by the availability of open source data management and analytics software. Money ultimately drives all major revolutions, and the big data movement is no different.

Liberating Data From IT

The big data revolution not only seeks to liberate data from software vendors, it wants to free data from the control of the IT department. Too many business users, analysts, and developers have felt stymied by the long-arm of the IT department or data warehousing team. They now want to overthrow these alleged "high priests" of data who have put corporate data under lock and key for architectural or security reasons or who take forever to respond to requests for custom reports and extracts. The big data movement gives business users the keys to the data kingdom, especially highly skilled analysts and developers.

Load and Go. The secret weapon in the big data arsenal is something that Amr Awadallah, founder and CTO at Cloudera, calls "schema at read." Essentially, this means that with Hadoop you don't have to model or transform data before you query it. This cultivates to a "load and go" environment where IT no longer stands between savvy analysts and the data. As long as analysts understand the structure and condition of the raw data and know how to write Java MapReduce code or use higher level languages like Pig or Hive, they can access and query the data without IT intervention (although they may need permission.)

For John Rauser, principal engineer at Amazon.com, Hadoop is a godsend. He and his team are using Hadoop to rewrite many data intensive applications that require multiple compute nodes. During his presentation at Strata Conference in New York City this fall, Rauser touted Hadoop's ability to handle myriad applications, both transactional and analytical, with both small and large data volumes. His message was that Hadoop promotes agility. Basically, if you can write MapReduce programs, you can build anything you want quickly without having to wait for IT. With Hadoop, you can move as fast or faster than the business.

This is a powerful message. And many data warehousing chieftains have tuned in. Tim Leonard, CTO of US Xpress loves Hadoop's versatility. He has already put it into production to augment his real-time data warehousing environment, which captures truck engine sensor data and transforms it into various key performance indicators displayed on near real-time dashboards. Also, a BI director for a well-known internet retailer uses Hadoop as a staging area for the data warehouse. He encourages his analysts to query Hadoop when they can't wait for data to be loaded into the warehouse or need access to detailed data in its raw, granular format.

Buyer Beware. To be fair, Hadoop today is a "buyer beware" environment. It is beyond the reach of ordinary business users, and even many power users. Today, Hadoop is agile only if you have lots of talented Java developers who understand data processing and an operations team with the expertise and time to manage Hadoop clusters. This steep expertise curve will diminish over time as the community refines higher level languages, like Hive and Pig, but even then, it's still pretty technical.

In contrast, a data warehouse is designed to meet the data needs of ordinary business users, although they may have to wait until the IT team finishes "baking the data" for general consumption. Unlike Hadoop, a data warehouse requires a data model that enforces data typing and referential integrity and ensures semantic consistency. It preprocesses data to detect and fix errors, standardize file formats, and aggregate and dimensionalize data to simplify access and optimize performance. Finally, data warehousing schemas present users with simple, business views of data culled from dozens of systems that are optimized for reporting and analysis. Hadoop simply doesn't do these things, nor should it.

Restoring Balance

Clearly, big data and data warehousing environments are very different animals, each designed to meet different needs. They are the yin and yang of data processing. One delivers agility, the other stability. One unleashes creativity, the other preserves consistency. Thus, it's disconcerting to hear some in the big data movement dismiss data warehousing as it's an inferior and antiquated form of data processing best preserved in a computer museum but not used for genuine business operations.

Every organization needs both Hadoop and data warehousing. These two environments need to work synergistically together. And it's not just that Hadoop should serve as a staging area for the data warehouse. That's today's architecture. Hadoop will grow beyond this to become a full-fledged reporting and analytical environment as well as a data processing hub. It will become a rich sandbox for savvy business analysts (whom we now call data scientists) to mine mountains of data for million dollar insights and answer unanticipated or urgent questions that the data warehouse is not designed to handle.

Summary

Thomas Jefferson once said, "The tree of liberty must be refreshed from time to time with the blood of patriots and tyrants." He was referring to the natural process by which political and social structures stagnate and stratify. This principle holds true for many aspects of life, including data processing.

For too long, we've tried to shoehorn all analytical pursuits into a single top-down data delivery environment that we call a data warehouse. This framework is now creaking and groaning from the strain of carrying too much baggage. It's time we liberate the data warehouse to do what it does best, which is deliver consistent, non-volatile data to business users to answer predefined questions and populate key performance indicators within standard corporate and departmental dashboards and reports.

It's gratifying to see Hadoop come along and shake up data warehousing orthodoxy. The big data movement helps clarify the strengths and limitations of data warehousing and underscore it role within an analytics architecture. And this leaves Hadoop and NoSQL technologies to do what they do best, which is provide a cost-effective, agile development environment for processing and query large volumes of unstructured data.

The organizations that figure out how to harmonize these environments will be the data champions of tomorrow.


Posted January 19, 2012 9:50 AM
Permalink | 4 Comments |

(Editor's note: this is the first article in a multi-part series on big data.)

Most people define "big data" by three attributes: volume, velocity, and variety. These describe the main characteristics of big data, but aren't exclusive to it. Many data warehouses today exhibit these same characteristics. This article drills into these attributes and shows what's common and not between data warehousing and big data environments.

Volume. Data volume is a slippery term. Many observers have noted this. What's large for some organizations is small for others. So, experts now define the term as "data that is no longer easy to manage." This is still pretty squishy as far as definitions go.

From a historical context, data warehousing has always been about "big data." The real difference is scale and scope, which have been growing steadily for years. In the 1990s, high-end data warehouses contained hundreds of gigabytes and then terabytes. Today, they have breached the petabyte range, and surely will ascend to exabytes sometime in the future.

Does that make a data warehouse a big data initiative? Not really. The big data movement today is largely about using open source data management software to cost effectively capture, store, and process semi-structured Web log data for a variety of tasks. (See "Let the Revolution Begin: Big Data Liberation Theology.") While data warehousing is focused solely on delivering structured data for reporting and analysis applications, the big data movement has broader implications. Hadoop and NoSQL can manage any type of data (structured, semi-structured, and unstructured) for virtually any type of application (analytical or transactional.)

Velocity. If you have big data, by default you have to load it in real-time using streaming or mini-batch load intervals. Otherwise, you can never keep up. This is nothing new. Most data warehousing teams have already converted from weekly and nightly batch refreshes to mini-batch cycles of 15 minutes or less that insert only deltas using change data capture and trickle feeding techniques. Hadoop and NoSQL databases are also evolving from batch loading of data to streaming it in real-time.

Some organizations also embrace real-time loading to meet operational business needs. For example, 1-800 CONTACTS displays orders and revenues in a data warehouse-driven dashboard updated every 15 minutes. US Xpress tracks idle time of its trucks by capturing sensor data from truck engines fed into a data warehouse that drives several real-time dashboards. Currently, most big data installations don't support real-time reporting environments, but the technology is evolving fast and this capability will soon become standard fare.

Variety. Variety generally refers to the ability to capture, store, and process multiple types of data. This is perhaps the biggest differentiator between data warehousing and big data environments. Hadoop is agnostic about data type and format. Just dump your data into a file and then write a Java program to get it out. For example, a Hadoop cluster can store Twitter and Facebook data, audio and video, documents and transactions, and so on.

In addition, the same Hadoop file can contain a jumble of different records--or key value pairs--each representing different entities or attributes. Although you can also do this in a columnar database, it's standard fare for Hadoop. (This is the "complexity" attribute that some industry observers add as a fourth attribute of big data.) Mixing record types puts the onus on the developer/analyst to sort through the records to find only the ones they want, which presumes foreknowledge about record types and identifiers. This would never fly in a data warehouse. The SQL would grok.

Summary

The three Vs provide a reasonable map to the big data landscape as long as you don't dig too deeply into the details. There, you'll find there is considerable overlap with traditional data warehousing techniques. The real difference between the two environments is that big data is better suited to handling a variety of data (i.e., unstructured and complex data) than a data warehouse which is designed to work with standardized, non-volatile data.


Posted January 19, 2012 9:43 AM
Permalink | 1 Comment |

Most business intelligence (BI) methodologies feature a circular workflow which might include the following steps: collect, integrate, report, analyze, decide, act. Unfortunately, these information technology (IT) centric workflows overlook the most important parts of the decision making process: collaborate and review.

Collaborate

Most people don't make decisions in a vacuum; they share ideas, options, and perspectives with others. Nor do they analyze data in a vacuum, at least anomalies or variances that require further attention. When people exchange ideas on a topic, they refine each other's knowledge, fill in missing gaps, and challenge assumptions. The result is a more comprehensive understanding of a situation and a better course of action.

Most of the time, people collaborate with peers in a live, two-way exchange of information. Today, this sharing typically occurs by telephone and in face-to-face meetings, or asynchronously via email. But fanned by the rising popularity of social media sites, like Facebook and LinkedIn, business software vendors are looking to bring online collaboration features to business organizations.

For example, BI vendors, such as Panorama, Lyza Software, Actuate, Tibco Spotfire, and Yellowfin, now embed annotations, discussions, shared workspaces and other collaboration features into their products. Other vendors sell general purpose collaboration platforms that serve as virtual water coolers and conference rooms where users can informally and formally share a wide range of information on almost any topic. Popular products here are Jive Software, which recently went public, SAP Streamwork, IBM Connection, and Microsoft Sharepoint.

By all accounts, 2012 will be a breakout year for business collaboration software. (To enhance our knowledge of collaboration and BI please take my current, five-minute survey HERE.)

Review

But collaboration alone is not enough to guarantee excellent decision outcomes. To do that, people must review their decisions and analyze how they could have done things better. Otherwise, they are doomed to repeat their mistakes. Success comes not just from working hard, but working smart. And that requires replaying past events and learning from them.

In the book, "How We Decide," author Jonah Lehrer tells the story of Bill Robertie, a world-class backgammon player (as well as chess and poker), who turned a childhood obsession into a lucrative career.

"Robertie didn't become a world champion just by playing a lot of backgammon. 'It's not the quantity of practice, it's the quality,' he says. According to Robertie, the most effective way to get better is to focus on your mistakes.... After Robertie plays a chess match, or a poker hand, or a backgammon game, he painstakingly reviews what happened. Every decision is critiqued and analyzed.... Even when he wins--and he almost always wins--he insists on searching for his errors, dissecting those decisions that could have been a little bit better. He knows that self-criticism is the secret to self-improvement, negative feedback is the best kind."

Interestingly, experts, like Robertie, after years spent learning from their mistakes, internalize this knowledge. This enables them to operate on a different plane of consciousness from non-experts. In the heat of action, their intuition takes over, and they simply "see" or "feel" what needs to be done. For example, Robertie said, "I knew I was getting good when I could just glance at a board and know what I should do. The game started to become very much a matter of aesthetics. My decisions increasingly depended on the look of things..."

Lehrer also describes how Tom Brady, the star quarterback for the New England Patriots football team, is able to make dozens of split-second decisions during a passing play. "Tom Brady spends hours watching game tape every week, critically looking at each of his passing decisions..." This weekly routine of self-criticism builds a literal body of knowledge that gives him an incredibly accurate "gut feel" when passing the ball during a game.

When asked to explain his abilities to make the right passing decisions, Brady says, "I don't know how I know where to pass. There are no firm rules. You just feel like you're going to the right place... And that's where I throw it."

Business teams, like individual experts, can build up a body of knowledge that enables them to make more accurate decisions, sometimes reflexively. But this only can happen if they assiduously study the impact of their decisions in a given area over a long period and strive to continuous improve.

Summary

To improve corporate decision making, individuals and teams not only need to collaborate, but they need to document and review each of their decisions. This will improve decision effectiveness and help build a true learning organization.

As BI professionals, we need to understand that our job is not done when we provide data to the business. We need to shepherd them along the entire analysis and decision making process. We need to embed collaboration into BI tools and link them to general purpose decision making platforms. In short, we not only need to be data experts, but decision experts as well.


Posted January 13, 2012 3:39 PM
Permalink | No Comments |

In the quest to deliver business intelligence (BI) solutions, we often get wrapped up in the technology and lose sight of the end game, which is to help the business make better decisions.

Lately, I've been reading about decision making. One book, "Decide and Deliver: 5 Steps to Breakthrough Performance in Your Organization," is chock full of practical advice about improving decision making in organizations. Written by a trio of Bain consultants, the book, offers some great anecdotes and useful tools for improving your company's decision effectiveness.

One key point is that decision speed and execution are as important as decision quality. The authors quote Bill Graber, a long-time General Electric executive, who describes the source of GE's extraordinary performance during the 1980s and 1990s. "There is this myth that we made better decisions than our competitors. That's just not true. Our decisions probably weren't any better than many other companies. What GE did do was make decisions a lot faster than anybody else, and once we made a decision, we followed through to make sure we delivered the results we were expecting."

The authors say there are four elements to decision effectiveness:


  1. Quality. Good decisions are based on facts, not opinions, and take into account risk. There is also healthy debate among valid alternatives.

  2. Speed. Good decisions are made at the right speed. If made too slowly, the competition gains an advantage. If made too fast, valid alternatives are not explored and wrong decisions are made.

  3. Yield. Yield refers to an organizations' ability to execute decisions. Decisions that don't trickle down to the people that need to carry out the actions don't succeed.

  4. Effort. Effort refers to the "time, trouble, expense, and sheer emotional energy" required to make or execute a decision. Too much effort slows decision making, too little effort results in poor quality decisions.

The authors created a survey instrument to help organizations quantify their capabilities along these four dimensions. They also created a formula to determine an organization's overall decision effectiveness: Quality x Speed x Yield - Effort.

As BI professionals, we need to strive to deliver the "right information to the right people in the right format" - a critical success factor touted in the book. Thus, it's imperative we embrace agile development techniques and operationalize our data flows to ensure timely delivery of information to decision makers so they can not only make the right decision, but do so quicker than the competition.


Posted January 3, 2012 8:37 AM
Permalink | 1 Comment |