In this series of articles, I explore the concept and reality of “big data.” What is it and where does it come from? Why is it important? How does it add value to the business? What is its impact on traditional data warehousing and business intelligence? In part 1, I explored the first two questions: what it is and where it comes from. In part 2, I examined the next two questions: Why is it important and how does it add business value? In this article, I answer an extra question on tools that will be important as we look at the impact on traditional data warehousing and business intelligence.
What’s All the Hadoop-la About?
It’s well nigh impossible to discuss big data without mentioning Hadoop. Yes, there are other tools and techniques around, but first, let’s deal with the elephant in the room. In fact, the elephant is far beyond the room and is trampling the entire data management landscape…
So, what is Hadoop? If you already know, bear with me a moment. It’s often called an ecosystem for storing and processing large volumes of data in a distributed, commodity hardware environment. And given the plethora of oddly named, independently developed – but related – components, it certainly is an ecosystem. But, in a very simplistic view, the Hadoop ecosystem consists of a diverse set of fairly basic utilities to support programmers who need to write distributed applications. These utilities began with Hadoop Distribution File System (HDFS) and a framework and controllers (MapReduce) to distribute application code (Map) to multiple servers and to collect and recombine the results (Reduce).
For those of us involved in data management, databases and data warehousing for many years, a couple of key considerations immediately come to mind. Yes, given enough clever programmers and extensive arrays of cheap hardware on such a system, you can run serious analytic and data-munging applications that were previously too slow, too large, or both to be viable. And, given the hyper-exponential growth of data from the Internet and the devices attached to it, there are real needs to be met and significant potential benefits to be gained. But no, this is not an environment where data management issues like governance, quality, consistency and integrity are easily handled. For that, you need databases, metadata stores and the like. Accordingly, the Hadoop community has been adding a herd of smaller elephants such as HBase to support random read/write, Hive as a SQL-like language, the Avro metadata schema, Oozie for workflow processing and Sqoop to manage populating HDFS files with data.
Don’t get me wrong – there are enormous business incentives to jump on the Hadoop howdah,1 as discussed in part 2 of this series. But, there’s also an enormous hype machine that seems to suggest that some magic is happening with Hadoop. There isn’t. It’s nothing more than a very successful programming environment for writing applications to process large quantities of data in parallel. Its strengths include scalability, low cost and, perhaps most importantly, flexibility to handle the wide range of ever-changing data structures characteristic of the emerging big data environment. Its weaknesses include a rapidly evolving and incoherent tool set, a highly programmatic and largely batch environment and a severe lack of attention to data management issues. This last point has been the downfall of many business intelligence initiatives. I see no reason why Hadoop should be an exception; in fact, given the data volumes and velocities involved, I predict even bigger issues.
SQL or NoSQL – That is the Question
In parallel with the Hadoop phenomenon (pun intended), the database world has seen a number of approaches evolve to deal with large data volumes. While it’s fair to say that most of these solutions do not stretch to the same extremes as Hadoop, it’s also the case that many big data problems aren’t extreme at all. Not all companies are Googles, eBays or Facebooks. What's big data for many businesses doesn’t amount to a hill of beans for the biggest data consumers. Furthermore, not all big(ish) data is as ill-defined and changeable as that encountered by businesses pushing the edges of Internet business models. In such cases, a database-based solution may be the answer, especially if consistency and reliability are strong requirements.
Massively parallel processing (MPP) relational databases have been available for many years from vendors such as Teradata and IBM, distributing SQL workloads over multiple processors in a well-managed, consistent and reliable environment. More recently, startups like Netezza (acquired in 2010 by IBM) and Greenplum (acquired by EMC in the same year) extended the model to create appliances built on commodity hardware. Such machines show performance improvement whether measured on query speed, data volumes or processing cost. Columnar databases from the likes of HP Vertica and ParAccel push analytic performance further. And, while in-memory databases like SAP
HANA will struggle to compete on big data for some time, they raise query performance even higher. The bottom line is that relational databases still have a significant role in the big data environment.
We also hear much about NoSQL databases such as Apache Cassandra and CouchDB, MongoDB and many more in the big data environment. As the name implies, such databases are not relational, neither in their data models nor in their access languages. But they are databases and support to varying degrees the data management characteristics associated with traditional databases, relational or otherwise. In some cases, their structures are “looser” than the relational model, allowing greater flexibility in handling changes in data structure or content as applications evolve. In other cases, the structure is optimized for specific application types, such as graph analysis, for example. In general, we can say that NoSQL databases span the gap between the highly structured and well-managed world of relational databases and the multi-structured, programmatic world of Hadoop.
Big data is high on the hype curve. The big data tools market is in the very early stages of development. It’s also one that’s in a constant and urgent state of change. Announcements of new tools and new versions of tools are an almost weekly occurrence. Every database and data management vendor wants a piece of the action; there’s hardly an existing tool that hasn’t been painted with go-bigger stripes. In all of this confusion, there is a strong temptation for IT to hit the pause button and wait for clarity. Except this business is not for pausing…
My advice is not to pause, but to move deliberately and with care. What exactly is the business need you’re trying to address? Do you actually have access (or the possibility thereof) to the big data you believe conceals the information you want? What analysis are you attempting and what combination of tools is needed to support it? Have they been tested to work together in a single distribution? Do you have the employees with analysis and programming skills – called “data scientists” these days – who can do the actual work? And last, but not least, how will you move any insights gained into production and operationalize the governance and data management aspects of the solution?
- A carriage that is positioned on the back of an elephant.
Recent articles by Barry Devlin