We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is founder and principal consultant at Eckerson Group,a research and consulting company focused on business intelligence, analytics and big data.

February 2013 Archives

After attending several big data conferences, I had to ask myself, "What's really new here?" After all, as a data warehousing practitioner, I've been doing "big data" for some 20 years. Sure, the scale and scope of the solutions has expanded along with the types of data that are processed. But much of what people are discussing seems a rehash of what we've already figured out.

After some deliberation, I came to the conclusion that there are six unique things about the current generation of "big data" which has become synonymous with Hadoop. Here they are:

  1. Unstructured data. Truth be told, the data warehousing community never had a good solution for processing unstructured and semi-structured data . Sure, we had workarounds, like storing this data as binary large objects or pointing to data in file systems. But we couldn't really query this data with SQL and combine it with our other data (although Oracle and IBM have pretty good extenders to do just this.) But now with Hadoop, we have a low-cost solution for storing and processing large volumes of unstructured and semi-structured data. Hadoop has quickly become an industry "standard" for dealing with this type of data. Now we just have to standardize the interfaces for integrating unstructured data in Hadoop with structured data in data warehouses.
  2. HDFS. The novel element of Hadoop (at least to SQL proponents) is that it's not based on a relational database. Rather, under the covers, Hadoop is a distributed file system into which you can dump any data without having to structure or model it first. Hadoop Distributed File System or HDFS runs on low-cost commodity servers, which it assumes will fail regularly. To ensure reliability in a highly unreliable environment, HDFS automatically transfers processing to an alternate server if one server fails. To do this, it requires that each block of data is replicated three times and placed on different servers, racks, and/or data centers. So with HDFS, your big data is three times bigger than your raw data. But this data expansion helps ensure high availability in a low-cost processing environment based on commodity servers.
  3. Schema at Read. Because Hadoop runs on a file system, you don't have to model and structure the data before loading it like you would do with a relational database. Consequently, the cost of loading data into Hadoop is much lower than the cost of loading data into a relational database. However, if you don't structure the data up front during load time, you have to structure it at query time. This is what "schema at read" means: whoever queries the data has to know the structure of the data to write a coherent query. In practice, this means that only the people who load the data know how to query it. This will change once Hadoop gets some reasonable metadata, but right now, issuing queries is a buyer-beware environment.
  4. MapReduce. Hadoop is a parallel processing environment, like most high-end, SQL-based analytical platforms. Hadoop spreads data across all its nodes, each of which has direct-attached storage. But writing parallel applications is complex. MapReduce is an API that shields developers from having to know the intricacies of writing parallel applications on a distributed file system. It takes cares of all the underlying inter-nodal communications, error checking, and so on. All developers need to know is what elements of their application can be parallelized or not.
  5. Open source. Hadoop is free; you can download it from the Apache Foundation and start building with it. For a big data platform, this is a radical proposition, especially since most commercial big data software easily carries a six-to seven-digit pricetag. Google developed the precursor to Hadoop as a cost-effective way to build its Web search indexes and then made its intellectual property public for others to benefit from its innovations. Google could have used relational databases to build its search indexes, but the costs doing so would have been astronomical and it would have not been the most elegant way to process Web data which is not classically structured.
  6. Data scientist. You need data scientists to extract value from Hadoop. From what I can tell, data scientists combine the skills of a business analyst, a statistician, a business domain expert, and a Java coder. In other words, they really don't exist. And if you can find one, they are expensive to hire. But the days of the data scientist are numbered; soon, the Hadoop community will deliver higher level languages and interfaces that make it easier for mere mortals to query the environment. Meanwhile, SQL-based vendors are working feverishly to integrate their products with Hadoop so that users can query Hadoop using familiar SQL-based tools without having to know how to access or manipulate Hadoop data.

So, those are the six unique things that Hadoop brings to the market. Probably the most significant is that Hadoop dramatically lowers the cost of loading data into an analytical environment. As such, organizations can now load all their data into Hadoop with financial or technical impunity. The mantra shifts from "load only what you need" to "load in case you need it." This makes Hadoop a much more flexible and agile environment, at least on the data loading side of the equation.

Posted February 19, 2013 11:59 AM
Permalink | 6 Comments |

When I went to my first big data conference almost three years ago, I thought I had been transported to a parallel universe: everyone was talking about data and analytics, yet data warehousing, SQL, and relational databases were dirty words.

Then, I looked at how people were dressed. I was the only person in the hall with a sports jacket, collared shirt, and leather shoes. Everyone else was wearing jeans, t-shirts, and sneakers and sported a pony tail. Then, it dawned on me: these were Java developers who had outgrown MySQL and were looking for a more scalable open source platform to run data-intensive, Web-based applications. And Hadoop was the answer to their big data dreams.

Immersing yourself in a foreign culture often crystallizes who you are and where you come from. For the first time in my professional life, I realized that I was a data guy from corporate IT who was wedded to commercial software and SQL-based processing. Standing brazenly in my blue blazer amidst a sea of Java coders, I also realized who I wasn't: an application developer who valued open source software.

Yet, my presence at this early big data event symbolized the beginning of the convergence of these two distinct communities: "Data people" and "applications people" have worked side by side for many years but rarely intermingled or aligned approaches. Fast forward two years. The big data conference I attended this fall had just as many "suits" as pony tails in the audience. The convergence is proceeding apace, as both communities recognize the opportunities of joining forces as well as the risks of remaining isolated.

Opportunities and Threats

Opportunities. For SQL-based vendors, the world of Hadoop and NoSQL opens new lucrative markets consisting of customers that want to harness large volumes of unstructured and semi-structured data for business gain. For Hadoop vendors, SQL-based products represent hundreds of potential applications that can legitimize the Hadoop platform once they interface with or are ported to Hadoop.

Threats. At the same time, Hadoop and NoSQL products represent a huge threat to traditional SQL-based vendors. Hadoop is like a swiss army knife that can be used to do almost anything. Consequently, many advocates believe Hadoop spells the death knoll of SQL-based databases and data warehouses. And they might have a point, since many data warehousing managers are just starting to question why they would want to move data out of Hadoop to do query, reporting, and data mining.

Conversely, SQL-based vendors, which collectively represent hundreds of billions of annual sales, aren't likely to cede this new market to a handful of open source upstarts. They are already circling the wagons, coopting Hadoop and NoSQL software by embedding them into their commercial products. This surround-and-drown strategy could spell the doom of independent, open source Hadoop vendors.

The only remaining question is which community wins in the end? My bet is on the commercial SQL vendors, which are much larger, more established, and offer robust, enterprise-caliber products that today's organizations rely on to run their businesses. They may have to radically transform their architectures and products suites to co-opt upstart Hadoop and NoSQL approaches, but they'll do what they need to stay on top and in control.

Posted February 13, 2013 1:58 PM
Permalink | No Comments |

Have you ever seen anything more hyped in the history of information management than big data? I haven't. Ok, artificial intelligence probably incited a similar media storm, but that was before my time.

What's in a number? The ironic thing is that data by itself has no intrinsic value. For example, if I gave you three numbers--100,000, 300,000, and 500,000--would you say they provide any value to you or your organization? Of course not. What if I told you those numbers referred to US currency? That's context, but no value. What if I said those figures referred to your manufacturing organization's net profits for the past three quarters? Now, that's interesting and certainly good news; but there is still no business value.

But what if said that your profit growth is due to home builders in the Midwest who are bundling your company's biggest electrical generators into their building packages in response to severe storms caused by climate change? Now, that's data--or insight--that you can take to the bank. For instance, armed with this knowledge, your organization might manufacture more high-end generators and fewer lower-end ones to accommodate the new demand. Or even better, you might identify the builders in the Midwest who haven't yet bundled your high-end generators with their home products and give them a 10% coupon to follow the lead of their peers.

Insights and actions. The point is that data--even big data--is useless without analysis and insight. Therefore, instead of talking about "big data", we should be talking about "big data analytics." Joining analytics and data can deliver real business value.

But there is a caveat: analysis without actions produces no value. It's one thing to know what drives profit growth in our example above, it's another to do something about it. Insights without actions don't get you very far. So, instead of talking about "big data analytics" we should really be talking about "big data actionable analytics."

Impacts. At risk of getting didactic, even actionable analytics doesn't guarantee business value. That's because actions that don't impact the organization in a positive direction are useless. That's like a salesman saying he should receive a commission for saying and doing all the right things with a client even though he didn't win the deal.

So, the most technically accurate term for this new phenomenon that is taking our industry by storm is: "Big data insights that drive actions that help an organization achieve its goals." Of course, that is too wordy and would never fly as an industry buzzword. But you get the point: data without analysis, and analysis without action, and action without positive impact, deliver no value.

So when you hear the hype surrounding big data, remember that data by itself has no value; it's what you do with it that counts.

Posted February 1, 2013 2:19 PM
Permalink | 1 Comment |