Originally published May 15, 2012
Hadoop is one of the up-and-coming technologies for performing analysis on big data, but to date very few universities have included it in their undergraduate or graduate curriculums. In a February 2012 article from InfoWorld, those already using the technology issued the warning that “Hadoop requires extensive training along with analytics expertise not seen in many IT shops today.” A ComputerWorld article singled out MIT and UC Berkeley as having already added Hadoop training and experience to their curriculums. Other educational institutions need to seek out practitioners in their area or poll alumni to determine if individuals that can impart their knowledge to college students are available and if so, prepare a curriculum to start training the next generation of IT employees and imbue them with the skills they will require to meet the challenges of the 21st century.
Hadoop is one of the newest technology solutions for performing analysis and deriving business intelligence on big data. On the TechTarget website, it is defined as “… a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.”
Hadoop is a combination of many tools and software products of which the primary two are HDFS (Hadoop Distributed File System) and MapReduce. In its current form, these components run primarily on the Linux operating system. Both of these components are Free Open Source Software (FOSS) and are licensed under the Apache License, Version 2.0.
HDFS is a file system that distributes the data to be analyzed across all the servers, which are typically inexpensive commodity hardware with internal or direct attached storage, available in a server farm. The data is replicated across several nodes so the failure of any one node does not disrupt the currently executing process. The HDFS file system maintains copies of the master catalog across many of these nodes so it always knows where specific chunks of the data reside. Support for very large datasets is provided by this mechanism of distributed storage.
As defined on the Apache Hadoop website, “… MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.” This software, typically written in Java, is used to map the input data that defines specifications for how it can be broken into chunks and how it is to be processed in parallel on multiple nodes of the cluster. The output of these map tasks is then used as input to the reduce tasks. The data and processing usually reside on the same nodes of the cluster to provide the scalability to handle the very large datasets that are typically processed using MapReduce. The basis of this processing is the mapping of the input data into key-value pairs. This is very similar to XML where each combination of start and end tags contains a specific value within a group of elements (e.g., < FirstName > Alex < /FirstName >). The reduce tasks combine the outputs of the map tasks into smaller sets of values that can then be used in additional analysis tasks.
To manage the processing, the Hadoop framework provides a job control mechanism that passes the required data to each of the nodes in the cluster, then starts and monitors the jobs on each of the processing nodes. If a particular node fails, the data and processing are automatically switched to a different node in the cluster, preventing the failure of a process due to a node becoming unavailable.
According to an October 2010 article in InfoWorld, the initial use of the Hadoop framework was to index web pages, but it is now being viewed as an alternative to other business intelligence (BI) products that rely on data residing in databases (structured data), since it can work with unstructured data from disparate data sources that database-oriented tools are unable to handle as effectively. The article goes on to state that corporaations “… are dropping data on the floor because they don't have a place to put it” and Hadoop clusters can provide a data storage and processing mechanism for this data so it is not wasted.
A very recent InfoWorld article examines the issues involved in the detection of cyber criminals by combining big data with traditional structured data residing in a data warehouse. The article mentions that the biggest problem will be identifying the network activity and behavior of individuals that are accessing the system for legitimate reasons as opposed to those out to steal sensitive information for nefarious purposes. It also mentions the inability of security information and event management (SIEM) and other intrusion detection systems (IDS) currently used for this purpose to correctly detect and report these types of events. They generate mounds of information that cannot be adequately analyzed to help identify the good users versus the bad users on their systems to protect the enterprise and its data.
A November 2011 article in ComputerWorld mentions that JPMorgan Chase is using Hadoop “… to improve fraud detection, IT risk management, and self service applications” and that Ebay is using it to “build a new search engine for its auction site.” The article goes on to warn that anyone using Hadoop and its associated technologies needs to consider the security implications of using this data in that environment because the currently provided security mechanisms of access control lists and Kerberos authentication are inadequate for most high security applications. It was noted that most government agencies utilizing Hadoop clusters are firewalling them into “… separate ‘enclaves’ … ” to protect the data and insure that only those with proper security clearance can see the data. One of the individuals interviewed for the article suggested that all sensitive data in transit or stored in a Hadoop cluster be encrypted at the record level. Given all these security concerns, many executives do not view Hadoop as being ready for enterprise consumption.
An article in ComputerWorld states that IT training in these skills can be obtained from organizations such as Cloudera, Hortonworks, IBM, MapR and Informatica. Cloudera has been offering this training for three years and they also offer a certification at the end of their four-day training program. According to the education director at Cloudera, their certification is deemed valuable by enterprises and organizations are starting to require the Cloudera Hadoop certification in their job postings. Hortonworks just started offering training and certification classes in February 2012 while IBM has been doing this since October 2011; the big difference between the two is that Hortonworks is targeting IT professionals with Java and Linux experience and IBM is targeting undergraduate and graduate students taking online classes in Hadoop. Upon completion of these classes, they are qualified to take a certification test; however, when the article was written, approximately 1% of students had taken the certification exam.
A recent ComputerWorld article mentions that the terms “Hadoop” and “data scientist” are starting to show up in job postings and that some of the most well-known organizations are posting these job requirements. The article mentions that Google has reported that the search term “data scientist” is 20 times higher – so far – in the first quarter of 2012 than it was in the last quarter of 2011 and that there were 195 job listings on Dice.com that mentioned this term. This indicates that the market for technical skills in IT and statistics is growing very quickly as businesses are realizing that this new technology can provide real value to their organizations. They will require a new IT specialty called “data science” to analyze the data extracted and processed using Hadoop and using statistical analysis to derive beneficial insights.
To address the shortage of individuals entering the workforce with the skills necessary to effectively utilize technologies like Hadoop, educational institutions need to offer courses in data analysis and data mining using statistical modeling methods as well as more specialized courses in Hadoop technologies like HDFS and MapReduce. These courses should have heavy emphasis on setting up Hadoop, HDFS, Java and any other software required for the environment to operate correctly. Since most students will be performing these tasks on a laptop or in a virtual machine (VM) environment, it may be more desirable to provide a preloaded VM to the students so they can see the end-state they need to achieve. This VM could also be used for the initial programming courses in Java so the students are not burdened with setting up the environment until they are in more advanced courses in operating system technologies.