In Greek mythology, Prometheus offended Zeus, the king of the gods, by giving fire to mankind without permission. To avenge this deed, Zeus sent the first woman, Pandora, along with her eponymous box, to Earth as a gift to Epimetheus, the brother of Prometheus. Fearing a trick, Prometheus begged Pandora not to open the box. However, curiosity overcame her. When she peeked inside, all the ills of the world flew out, leaving but one thing within – hope.
As in the myth, "big data" certainly poured out of the box in 2012. Whether it is evil or largely a source of hope, and business benefits, for organizations remains to be seen. The past year has shown both sides – from privacy issues and nonsensical statistics to improved medical treatments and modeling of climate change. The possibilities for using big data are proving endless, and the technology to enable them has evolved by leaps and bounds.
But there's more to the story. Prometheus means "foresight" in Greek, and Epimetheus means "after-thought." In today's business environment, big data has literally shifted our thinking on analytics from looking back to looking forward. As a result, we are beginning to see that big data raises real questions about how data is gathered, managed and used.
The year 2013 will certainly see big data technology continue to be developed and improved, along with the emergency of new business uses – and indeed, some dubious schemes – for it. But the IT focus must now shift to the architectural and governance issues that arise in big data environments. The phenomenon shows what we have long processed in business is but a small proportion of the real – and almost limitless – world of information. IT and business intelligence (BI) managers must be prepared to handle much more than that.
I propose a new model for understanding big data in the context of all the structured data and less structured information created and captured by companies. The model (see Figure 1) includes three distinct domains:
- Human-Sourced Information. All information ultimately originates from people – it's the highly subjective record of human experiences, from text and images to audio and video, now almost entirely digitized and stored electronically. Loosely structured and often ungoverned, this information must be systematized and standardized for reliable use, by modeling and validating it in operational and BI systems to create the data that's in the second domain.
- Process-Mediated Data. Business processes record and monitor all business events, such as registering a customer and manufacturing a product. Process-mediated data is the highly structured and modeled data, as well as the contextual metadata, produced by these processes. Such data has long been the vast majority of what IT processed and managed in relational databases.
- Machine-Generated Data. Sensors and various machines record data on a wide array of events and situations that they monitor. Their output is machine-generated data, and from simple sensor records to complex computer logs, it is well-structured and very reliable. As sensors proliferate, the data they capture is becoming an important component of the information used for BI and analytics. The data's size and delivery speed is often beyond traditional approaches; in such cases, standalone high-performance relational and NoSQL databases are needed.
Figure 1: Unstructured forms of big data are most effective for analytics uses when combined with traditional structured data, according to the author's model.
In essence, the model I propose shows that emerging big data sources, often poorly governed or managed themselves, need to be enhanced with traditional process-mediated data to deliver useful and relevant business analytics. As a result, the market focus is likely to shift substantially from big data startups and small vendors to more established vendors with enterprise-scale technologies for semantic and physical integration of multiple data types from various sources – a trend we've seen emerging.
The second emergent trend in 2012 that is like to accelerate in 2013 is an increased emphasis on business value. So far, we've seen much interest in analysis of social media information for brand awareness, emerging product issues and more, as well as big data analytics
in support of operational excellence. The focus is like to set to shift, however, toward process innovation – the use of previously unavailable data from a variety of sources to invent new ways of doing old business.
Finally, flying under the radar now is the next wave of really big data technologies from Web denizens such as Google and Facebook, whose needs have exceeded the capabilities of file-based tools like Hadoop. A new wave of tools – Dremel, Caffeine, Pregel, Spanner and Prism – may be upon us as the biggest big data proponents inexorably move the needle from a batch-oriented, eventually consistent paradigm toward a distributed but ACID-compliant database mind-set.
Recent articles by Barry Devlin