We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


What is This Thing Called Big Data? Part 1 of Big Data: Giant Wave or Huge Precipice?

Originally published October 6, 2011

In this short series of articles, I explore the concept and reality of “big data.” What is it and where does it come from? Why is it important? How does it add value to the business? What is its impact on traditional data warehousing and business intelligence? In part 1, I explore the first two questions: what it is and where it comes from.

It’s difficult to avoid big data these days. More correctly, it’s difficult to avoid the phrase “big data.” It has become such an integral part of the sales pitches of so many vendors and the blog posts of so many experts that one might be forced to conclude that big data is all-pervasive. The truth is far more complex. Even a definition of big data is elusive.

I went in search of a definition at the fount of all modern knowledge, Wikipedia. There I found: “Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.” As definitions go, this is pretty vague. Not only does the size span four-plus orders of magnitude, but it’s also a moving target and defined on the basis of “tolerable” (whatever that may mean?) performance of common (to whom?) software tools. Small wonder that almost every vendor claims to be able to support it.

That’s not to say that data volumes aren’t growing, and at a staggering rate. According to International Data Corporation (IDC),1 the volume of data that will be generated in the digital world in 2011 is 1,800 exabytes (EB), or 1,800 million terabytes and set to grow almost 40% in the next year to 2,500 EB. By 2020, IDC predicts the number will have reached 35,000 EB, or 35 zettabytes (ZB), and it seems there won’t be enough disk space to store it all!

Such figures are beyond comprehension. Of course, much of this data is comprised of video, audio and image data generated by a general public waving smart phone cameras wherever they go. Perhaps we can argue sensibly that IT doesn’t need to worry so much, unless your business offers cloud data storage. But even in the relatively staid world of more traditional enterprise IT, the numbers are scary. In figure 1, we narrow the focus to enterprise data showing traditional, structured and unstructured2 data volumes. In 2005, enterprises stored 4 EB of structured data; by 2015 it will grow to 29 EB. That’s a compound annual growth rate (CAGR) of over 20%. The figures for unstructured business data confirm our worst fears: such data now far exceeds structured data in volumes and is growing even faster. In 2005, it amounted to 22 EB and reaches 1,600 EB by 2015. That’s a staggering CAGR of approximately 60%, considerably faster than Moore’s Law.




Figure 1: Total Enterprise Data Growth, 2005-2015
© 9sight Consulting, 2011

So, there is certainly a lot of data out there, and it’s growing fast. But, what is it and where is it coming from? One way of understanding it is by looking at its sources.

Categorizing Big Data by Source

Further research identified a number of major sources of big data volumes that potentially concern IT managers in mainstream businesses. In the process, I finally tracked down the likely original source of the fear of big data – scientists! Modern scientific instrumentation has, for many years now, been capable of producing large quantities on digital data on an ongoing basis. To quote one current example, the new Large Hadron particle collider at CERN on the Swiss/French border is capable of generating 40 TB of data per second of operation. Perhaps we may be thankful that it took so long to get it fully operational! Astrophysics, genetics and meteorology, to name but a few areas of scientific research, are all producing data at large and ever-increasing rates.

I imagine that the percentage of readers here involved in supporting scientific research is rather small, so how does this affect the rest of us? Well, it’s a fact of the modern world that what is common in pure research doesn’t take too long to get incorporated into everyday engineering. Today, we are increasingly instrumenting our machines to enable them to measure and report continuously on their performance. For example, only a few years ago, ongoing telemetry of engine performance was confined to jet aircraft valued at millions of dollars. Now, automobile manufacturers are embedding monitors throughout their vehicles, providing continuous information on performance of all aspects of the mechanical systems of the vehicle. And, of course, once data is available, businesses will look for ways to profit from it. Such machine sensor data is the first category of big data we must consider.

Machine sensor data is not new, of course. ATMs and telephone switches, for example, have been providing such data for years. But, there are two key differences: (1) the much larger data volumes being generated and (2) the potential “rawness” of the data. In the old world, machine data was usually well pre-processed before IT (and users) had to deal with it. This new machine-sourced data is going to be much more “in your face”!

The second category of big data that is of interest is also machine generated, this time by computers. Computers are capable of recording substantial amounts of information about the events and conditions that characterize their current environments. In the past, such information has typically been of interest only to the operations and security staff of the organization. Today, it’s recognized that such data may contain interesting information about the actions and behavior of Internet and other users and thus provide potentially useful insights into their wants and needs. While potentially more varied than machine data, computer log data is still relatively well-structured and well-defined.

Moving to the next level, we see the data / information generated by the users themselves. It makes sense to divide this level into two categories based on the ease with which the information can be analyzed and meaning extracted from it. So, the third category is textual information generated by people through emails, instant messaging, blogging, and so on. This information contains both structured and well-defined fields, together with more free-form information where context and inference are vital to a complete understanding of the information.

The fourth category, and by far the largest, is audio, image and video data. This data has the loosest structural characteristics, is the most voluminous and is the most difficult from which to extract meaningful conclusions and useful information.

These latter three categories are at the heart of the current excitement around big data. Large Internet-centric businesses such as Amazon, Google, eBay, Twitter and Facebook are using this information in enormous volumes to understand consumer behavior and predict specific needs and overall trends. The first category, machine sensor data, is probably generating less of a buzz, but is driving substantial changes in some business models. For example, automobile sensor data is being used to evaluate driver behavior and driving (if you’ll excuse the pun) substantial changes in the automobile insurance industry.

Conclusions

This simple categorization of big data into four classes provides some interesting insights. We can immediately see that big data is not a homogeneous data space. These different categories of data are very likely to require different approaches to management and processing. They are likely to be best handled with different technologies and tools.

It should also be clear that none of these categories is unknown to IT today, who deals with them, albeit in smaller volumes, on a daily basis. Big data is not, in essence, something entirely new. The problem is, to a large extent, one of scale; hence the name. However, the insights we currently have into these categories listed earlier and the different tools and approaches they require must be carried forward into how we handle these same data categories at a larger scale.

We can also understand why the definition quoted earlier is necessarily vague. Big data is neither new nor specific. It is simply a pushing of the boundaries of data we already know in ways with which we are largely unfamiliar. As a result, depending on your point of view, big data appears either as a giant wave of business opportunity or a huge precipice of potential technological and management pain.

In the next article in this series, I’ll be looking at why big data has generated such interest and concern and what value it can offer to business in general.

End Notes:
  1. Based on IDC “Expanding Digital Universe” 2007-2011 sponsored by EMC, http://bit.ly/IDC_Digital_Universe. (Some figures are estimates and extrapolations from the published work and are believed to be within the correct order of magnitude.)

  2. The terms unstructured and structured are widely used to categorize data. As we shall see in this series of articles, these terms are both misleading and inadequate, but are used in this section in the broadly accepted meaning.

  • Barry DevlinBarry Devlin
    Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

    Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

    Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

    Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recent articles by Barry Devlin

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!