We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Big Data – Separating the Hype from Reality

Originally published March 5, 2012

The air’s thin at the top of the hype curve, so breathe deeply as we explore the reality of “big data.”

Big data is everywhere these days. Marketing materials bristle with references to how products have been enhanced to handle big data. Consultants and analysts are busy writing new articles (as I am too!) and creating elegant presentations. But, the sad reality is that big data remains one of the most ill-defined terms we’ve seen in many a year. Take a look at the Wikipedia definition, and you’ll see what I mean. The problem is that data volume (i.e., big) is a metric that tells us very little about the data characteristics that allow us to understand it sources, its uses in business and how we need to handle it in practice. Even the emerging approach of talking about big data in terms of volume, velocity and variety leaves a lot to be desired in terms of clarity about what big data really is.

Business Drivers and Origins

So, what is the problem? And, more to the point, is there an answer? The problem is that big data in a technical sense, beyond the common characteristic of “bigness,” has very little else in common. Hence, the difficulty in coming up with a single, all-encompassing definition.

However, in a business sense, there is one common theme – predicting the future! Based on statistical analysis of past and present reality, we try to predict and/or influence future events, behaviors and so on. This is the same goal that we’ve seen in data mining since the 1990s. In simple terms, the business driver for big data is a logical extension of data mining. The novelty lies in the fact that with ever larger data volumes and new data sources, we can obtain more statistically accurate results and, hopefully, make more accurate predictions.

Thus we return to data volumes. The origins of the term big data can be traced back to the scientific community. Astronomy, physics, biology and more have long been at the forefront of collecting vast quantities of data from ever more sophisticated sensors. By the early 2000s, they encountered significant problems in processing and storing these volumes and coined the term big data – probably as a synonym for big headaches! Thus, we see here the beginnings of the business driver mentioned above, as science today is founded largely on statistical analysis of collected data. What begins in pure science moves inexorably to engineering and finally emerges in business and, especially, marketing.

Definition and Handling

It is that evolution in usage that leads to the conclusion that no single definition of big data is possible – it’s a phrase that takes meaning from the context of its use. Do not despair, however! This thinking also leads directly to a more useful understanding of four different classes of big data, each with well-defined characteristics and uses, as shown in the figure below, laid out according to its sourcing and structuring.

Figure 1: The Four Classes of Big Data

The first class is metrics and measures, emanating more or less directly from sensors, monitoring devices and less complex machines, including RFID readers; ZigBee devices and the multitude of sensors in modern airplanes, cars and even cameras; and, perhaps most interestingly, in smartphones. Such data is highly structured and reflects discrete events or characteristics of the physical world. The second class, also machine-sourced, consists of computer event logs, tracking everything from processor usage and database transactions to clickstreams and instant message distribution. While machine-generated, data in both of these classes are proxies for events in the real world and, in business terms, those that record the results of human actions are of particular interest. For example, measurements of speed, acceleration and braking forces from an automobile can be used to make inferences about driver behavior and thus insurance risk.

In the top half of the diagram, in classes three and four, we have social media information directly created by humans, divided into the more highly structured textual information and the less structured multimedia audio, image and video categories. Statistical analysis of such information gives direct access to people’s opinions and reactions, allowing new methods of individual marketing and direct response to emerging opportunities or problems. Much of the current hype around big data comes from the insights into customer behavior that Web giants like Google and eBay and mega-retailers such as Walmart can obtain by analyzing data in these classes (especially the textual class, so far). However, in the longer term, machine-generated data, particularly class one, is likely to be the big game-changer, simply because of the number of events recorded and communicated.

But What About My Current BI System?

From a business viewpoint, big data significantly shifts the emphasis in business intelligence (BI) from reporting and problem-solving to prediction. The former won’t go away, of course, and high levels of competence and investment in those aspects will continue to be needed – just to stay in the game. However, the ability to anticipate changes in the market provided by advanced analytics on large data volumes will separate the leaders from the also-rans.

From an IT point of view, the issue divides largely between the top and bottom halves of the diagram shown. In the bottom half, we deal with data that is structurally similar to that on which traditional business intelligence is based. At the high end, volumes and velocity will continue to demand innovative technological solutions. Lower down the scale, traditional tools and techniques will likely stretch upwards to larger slices of the middle ground. However, one thing is clear: The old thinking that all data must be funneled through an enterprise data warehouse cannot survive.

This becomes even clearer when we look at the top half of the picture. The data found there has very different characteristics than traditional BI data. Not only does it have far less structure, but also that structure is fluid and its semantics largely unfavorable for the type of prior modeling that is the foundation of traditional data warehousing. This socially sourced data will most likely continue to require a very different environment and approach to analysis and management. However, it will need to be linked to classic business intelligence via summary result data imported into the warehouse environment and metadata that bridges the semantic gap between the two areas.

The reality is that big data is going to provide business intelligence with a significant growth stretch and that the technology is evolving and merging rapidly to meet this challenge. Check out my big data articles on BeyeNETWORK for deeper insights.

  • Barry DevlinBarry Devlin
    Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

    Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

    Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

    Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recent articles by Barry Devlin



Want to post a comment? Login or become a member today!

Posted March 6, 2012 by Barry Devlin

Juha, Absolutely agree with your comment that much more rapid modelling and acceptance of shorter solution lifecycles are required for these data types.  In fact, they need to be modelled "in flight" rather than in advance as is the traditional approach.

Is this comment inappropriate? Click here to flag this comment.

Posted March 6, 2012 by Juha Teljo

Very good article ! WIth the current consusion I am starting to lean on a definition where any data that has characteristics (like volume or speed) that require new methods of being able to manage it start falling into this BIg Data bucket.  Linking this to BI - I think there will be a requirement for much more rapid modelling and acceptance of shorter solution lifecycles than in traditional transaction-warehouse-bi solutions where once modelled processes can be intact for years. Requirements and solutions I see at the top of the diagram are often much more one-off or at least significant for shorted period of time.



Is this comment inappropriate? Click here to flag this comment.