After attending several big data conferences, I had to ask myself, "What's really new here?" After all, as a data warehousing practitioner, I've been doing "big data" for some 20 years. Sure, the scale and scope of the solutions has expanded along with the types of data that are processed. But much of what people are discussing seems a rehash of what we've already figured out.
After some deliberation, I came to the conclusion that there are six unique things about the current generation of "big data" which has become synonymous with Hadoop. Here they are:
- Unstructured data. Truth be told, the data warehousing community never had a good solution for processing unstructured and semi-structured data . Sure, we had workarounds, like storing this data as binary large objects or pointing to data in file systems. But we couldn't really query this data with SQL and combine it with our other data (although Oracle and IBM have pretty good extenders to do just this.) But now with Hadoop, we have a low-cost solution for storing and processing large volumes of unstructured and semi-structured data. Hadoop has quickly become an industry "standard" for dealing with this type of data. Now we just have to standardize the interfaces for integrating unstructured data in Hadoop with structured data in data warehouses.
- HDFS. The novel element of Hadoop (at least to SQL proponents) is that it's not based on a relational database. Rather, under the covers, Hadoop is a distributed file system into which you can dump any data without having to structure or model it first. Hadoop Distributed File System or HDFS runs on low-cost commodity servers, which it assumes will fail regularly. To ensure reliability in a highly unreliable environment, HDFS automatically transfers processing to an alternate server if one server fails. To do this, it requires that each block of data is replicated three times and placed on different servers, racks, and/or data centers. So with HDFS, your big data is three times bigger than your raw data. But this data expansion helps ensure high availability in a low-cost processing environment based on commodity servers.
- Schema at Read. Because Hadoop runs on a file system, you don't have to model and structure the data before loading it like you would do with a relational database. Consequently, the cost of loading data into Hadoop is much lower than the cost of loading data into a relational database. However, if you don't structure the data up front during load time, you have to structure it at query time. This is what "schema at read" means: whoever queries the data has to know the structure of the data to write a coherent query. In practice, this means that only the people who load the data know how to query it. This will change once Hadoop gets some reasonable metadata, but right now, issuing queries is a buyer-beware environment.
- MapReduce. Hadoop is a parallel processing environment, like most high-end, SQL-based analytical platforms. Hadoop spreads data across all its nodes, each of which has direct-attached storage. But writing parallel applications is complex. MapReduce is an API that shields developers from having to know the intricacies of writing parallel applications on a distributed file system. It takes cares of all the underlying inter-nodal communications, error checking, and so on. All developers need to know is what elements of their application can be parallelized or not.
- Open source. Hadoop is free; you can download it from the Apache Foundation and start building with it. For a big data platform, this is a radical proposition, especially since most commercial big data software easily carries a six-to seven-digit pricetag. Google developed the precursor to Hadoop as a cost-effective way to build its Web search indexes and then made its intellectual property public for others to benefit from its innovations. Google could have used relational databases to build its search indexes, but the costs doing so would have been astronomical and it would have not been the most elegant way to process Web data which is not classically structured.
- Data scientist. You need data scientists to extract value from Hadoop. From what I can tell, data scientists combine the skills of a business analyst, a statistician, a business domain expert, and a Java coder. In other words, they really don't exist. And if you can find one, they are expensive to hire. But the days of the data scientist are numbered; soon, the Hadoop community will deliver higher level languages and interfaces that make it easier for mere mortals to query the environment. Meanwhile, SQL-based vendors are working feverishly to integrate their products with Hadoop so that users can query Hadoop using familiar SQL-based tools without having to know how to access or manipulate Hadoop data.
So, those are the six unique things that Hadoop brings to the market. Probably the most significant is that Hadoop dramatically lowers the cost of loading data into an analytical environment. As such, organizations can now load all their data into Hadoop with financial or technical impunity. The mantra shifts from "load only what you need" to "load in case you need it." This makes Hadoop a much more flexible and agile environment, at least on the data loading side of the equation.
Posted February 19, 2013 11:59 AM
Permalink | 5 Comments |