<-- Back to full color view
Important Things You Need to Know About Hadoop
Originally published September 17, 2013
Increasing interest in using Hadoop for data management, transformation and analysis has led to significant development efforts by commercial vendors to enhance and extend the open source Apache Hadoop framework and offer a range of different Hadoop platforms. Although these platforms vary by vendor, there are some common characteristics and trends that potential users need to be aware of that apply to all platforms.
- Core components based on open source software. All platforms include Apache Hadoop core services, HDFS and MapReduce (MR), plus Apache Hive and Apache Pig. Many also include the HBase DBMS.
- Massively parallel processing on low-cost hardware. A key benefit of Hadoop is its ability to support massively parallel processing (MPP) on low-cost hardware. Several fault-tolerant features are built into the software to compensate for the likely higher failure rates of this hardware. Hadoop can also be deployed on a desktop computer or SMP server for evaluation and development purposes.
The nodes of a Hadoop MPP cluster generally use standalone or rack servers with direct-attached storage (DAS). Other hardware configurations are possible, including the use of blade servers and shared storage architectures.
- Multiple deployment platforms. Hadoop products run on Linux, but support for Microsoft Windows is available. Some vendors offer optimized versions of Hadoop for running as a virtual machine. The ability to operate Hadoop in a cloud-computing environment is a growing trend. There is also increasing interest in deploying Hadoop in the open source OpenStack cloud operating environment.††††
- Multiple application interfaces. The direction of Hadoop vendors is to provide three types of application interfaces to Hadoop data: procedural programming language (Java, other MR-supported languages, Pig, R), declarative query language (Hive and other SQL layers) and search (Apache Lucene/Solr and other proprietary search engines).
One area that is receiving significant attention and development effort is the addition of an SQL layer on top of the Hadoop environment. This topic is discussed in more detail below.
- Data interchange with other enterprise systems. For Hadoop to be successful in mainstream enterprises, it must be able to exchange data with existing IT systems. Many vendors offer their own data transfer solutions. These solutions, however, vary significantly both in function and in performance.
Several relational DBMS vendors (IBM, Oracle, Teradata, HP Vertica, and Actian ParAccel) provide bulk data transfer features that exploit the parallel computing capabilities of both source and target systems to improve data transfer performance. These transfer operations are often done using Hadoop MR programs that are invoked using SQL function calls in relational DBMS applications and scripts.
Data integration vendors (Informatica, Actian Pervasive, Pentaho, Syncsort, Talend and SAS) extend bulk data transfer with the ability to transform the data before it is transferred. This transformation is frequently achieved using MR programs.††††††††
- Evolving data management tools. Hadoop to date has offered little in the way of the data management tools that are a key feature of most enterprise systems. Some initial steps are now being taken to fill this gap. One of these is the Apache HCatalog facility, which provides a common interface to Hive, Pig and HDFS metadata. Several vendors are also beginning to offer their own tools for data management, and these tools will become increasingly more important as the use of Hadoop continues to grow.
- Removal of Hadoop single points of failure (SPOF) and performance bottlenecks. There are several Hadoop SPOFs (Hadoop NameNode and JobTracker service) and performance limitations (scheduling and workload management, query performance of Hive and HDFS), and both Apache and Hadoop vendors are working to overcome these issues.
- Evolving system management tools. Many companies find Hadoop difficult to install, administer, tune and maintain. This is not only due to a general lack of tooling, but also because these companies donít have the required skills. Vendor Hadoop data management platforms help simplify matters, especially in the area of installation, and most Hadoop vendors also offer consulting and support services. There is, however, still a lack of enterprise quality system management tools for Hadoop, and several open source and vendor projects are in the works to help solve this. An example of such a project is Apache Ambari.
- Increasing number of partner relationships. Hadoop has become closely associated with the present industry focus on the concept of big data. The key to realizing investments in big data technologies is how to leverage this data for business advantage. Good development tools coupled with easy-to-use business tools and packaged applications will play an important role here. This is why many Hadoop vendors are partnering with tools developers and third-party vendors to build out a repertoire of IT and business user capabilities.
SQL on Hadoop
The addition of an SQL layer on Hadoop offers several benefits. One of the key ones is that it extends Hadoopís programmatic MR model to support SQL-based tools, which in turn enables non-programmers, such as business analysts, to access Hadoop data. Depending on how the SQL layer is implemented, it can also extend Hadoopís batch-oriented workload model to enable a more ad hoc style of processing. The degree to which these benefits can be realized is dependent on how this layer is implemented.
One of the first SQL layers developed for Hadoop was Hive, which presents data to developers in the form of tables. The data in these tables is accessed and manipulated using HiveQL query statements. Hive does not have a sophisticated query optimizer, but instead uses a query compiler and a set of rules to convert HiveQL queries into a series of MR batch jobs for execution on a Hadoop cluster.
To overcome the limitations of Hive and to enable faster query performance, several vendors are building enhanced SQL layers on top of Hadoop. These SQL layers employ a variety of techniques. Some of the main ones and the vendors that support them are outlined below:
- Improve the functionality and performance of Hive (Hortonworks and Intel).
- Add an SQL layer that bypasses Hive and MapReduce and accesses the Hadoop data directly. The downside of this approach is that the developer loses the power of MR processing. For this reason, this approach complements Hive and MapReduce, rather than replaces them (Apache and Cloudera).†
- Develop new on-disk and/or in-memory Hadoop data handlers and data formats that are more suited to ad hoc query processing (Apache, Cloudera, Hortonworks, IBM, Intel and Pivotal).
- Build a new SQL query engine running on Hadoop that uses a query splitter to route SQL query fragments to one or more underlying data handlers to access and process the data (Hadapt and IBM).
It must be realized that it has taken RDBMS vendors many years of research and development to build high-performance SQL optimizers and data handlers. It is unreasonable, therefore, to expect the relatively immature Hadoop SQL layers to provide the same level of functionality and performance, especially where complex SQL workloads are required.
Things to Consider When Selecting a Hadoop Platform
Before purchasing a Hadoop platform, there are several things you should consider.
- Identify the types of data, applications and workloads you wish to deploy on Hadoop. The first task is to identify the sources and types of data required to satisfy business requirements, and to ascertain the processing to be performed on that data. This latter information enables project managers to evaluate which Hadoop platforms are capable of providing the required functions and performance, and to determine the hardware and software needed to support the project.
- When determining the cost of the Hadoop data management platform calculate the total cost of ownership for the system. Many customers, when comparing or purchasing a Hadoop system, simply calculate the cost per terabyte of the system, or the purchase price for the hardware and software. There are many additional costs involved and it is important that all of these are taken into account.†††††
- Understand the education and skills requirements for implementing Hadoop. Many organizations donít have all of the skills or experience required to build and deploy a Hadoop environment. It is important to identify gaps in skill sets that need to be filled before the project can proceed. In general, services from Hadoop vendors and/or third-party consulting companies can be used to fill those gaps.
- Investigate if other parts of the organization are using Hadoop. Hands-on Hadoop experience takes time to acquire, and project managers should investigate if there are other parts of the organization that are using Hadoop who can act as a source of knowledge and best practices.††††
- Talk to customers who have deployed Hadoop solutions. Organizations should also look for other companies in the same market sector who are using Hadoop to see how they are deploying Hadoop and to share information about best practices.†
- Be careful in your choice of Hadoop data management platform and understand its impact on the existing IT infrastructure. There are many different components that are potentially involved in a Hadoop environment. The actual components used will depend on project requirements, performance needs, and on existing IT deployment strategies and standards. Careful platform selection is required, not only to ensure that the platform selected satisfies requirements, but also that can easily be integrated into the existing IT, business intelligence and data warehousing infrastructure.
- Be realistic, but also pragmatic, about the value and use of Hadoop. One of the main difficulties in selecting a Hadoop solution is separating reality from hype. While there is no doubt that Hadoop is a valuable addition to the technology toolbox, it is not a panacea. It must be recognized that Hadoop is still immature, even though it is changing and evolving rapidly. It is important that organizations carefully select a platform that supports their needs, avoids vendor lock-in where possible, and also provides a flexible architecture that can evolve with changes in the Hadoop marketplace. †
This article is based on an in-depth research report and e-book by Colin White entitled Hadoop Data Management Platforms: Market Segmentation and Product Positioning.
The report provides an introduction to Hadoop, reviews leading Hadoop solutions from a variety of vendors and suggests criteria for selecting a Hadoop data management platform. Click here
to download this free report.
SOURCE: Important Things You Need to Know About Hadoop
Recent articles by Colin White