Blog: Colin White Subscribe to this blog's RSS feed!

Colin White

I like the various blogs associated with my many hobbies and even those to do with work. I find them very useful and I was excited when the Business Intelligence Network invited me to write my very own blog. At last I now have somewhere to park all the various tidbits that I know are useful, but I am not sure what to do with. I am interested in a wide range of information technologies and so you might find my thoughts will bounce around a bit. I hope these thoughts will provoke some interesting discussions.

About the author >

Colin White is the founder of BI Research and president of DataBase Associates Inc. As an analyst, educator and writer, he is well known for his in-depth knowledge of data management, information integration, and business intelligence technologies and how they can be used for building the smart and agile business. With many years of IT experience, he has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. Colin has written numerous articles and papers on deploying new and evolving information technologies for business benefit and is a regular contributor to several leading print- and web-based industry journals. For ten years he was the conference chair of the Shared Insights Portals, Content Management, and Collaboration conference. He was also the conference director of the DB/EXPO trade show and conference.

Editor's Note: More articles and resources are available in Colin's BeyeNETWORK Expert Channel. Be sure to visit today!

Last week I presented at the Big Data Summit and attended Hadoop World in New York. Both events focused on the use of Hadoop and MapReduce for the processing and analyzing of very large amounts of data.

The Big Data Summit was organized by Aster Data and sponsored by Informatica and Microstrategy. Given that the summit was in the same hotel as that used for Hadoop World the following day, it would be reasonable to expect that most of the attendees would be attending both events. This was not entirely the case. Many of the summit attendees came from enterprise IT backgrounds and these folks were clearly interested in the role of Hadoop in enterprise systems. Whereas many of them were knowledgeable about Hadoop, an equal number were not.

The message coming out of the event was that Hadoop is a powerful tool for the batch processing of huge quantities of data, but coexistence with existing enterprise systems is fundamental to success. This is why Aster Data decided to use the event to launch their Hadoop Data Connector, which uses Aster's SQL-MapReduce (SQL-MR) capabilities to support the bi-directional exchange of data between Aster's analytical database system and the Hadoop Distributed File System (HDFS). One important use of Hadoop is to preprocess, filter, and transform vast quantities of semi-structured and unstructured data for loading into a data warehouse. This can be thought of as Hadoop ETL. Good load performance in this environment is critical.

Case studies from Comscore and LinkedIn demonstrated the power MapReduce in processing pedabytes of data. In the case of Comscore they are aiming to manage and analyze 3 months of detailed records (160 billion records) using Aster SQL/MR. LinkedIn, on the other hand is using a combination of Hadoop and Aster's MapReduce capabilities and moving data between the two environments. Performance and parallel processing is important for efficiently managing this exchange of data. This latter message was repeated by several other case studies at both events.

Hadoop World had a much more open source and developer feel to it. It was organized by Cloudera and had about 500 attendees. About half the audience was using Amazon Web Services and clearly experienced in Hadoop. Sponsors included Amazon Web Services, IBM, Facebook and Yahoo, all of whom gave keynotes. These keynotes were great for big numbers. Yahoo, for example, has 25,000 nodes running Hadoop (the biggest cluster has 4,000 nodes). Floor space and power consumption become major issues when deploying this level of commodity hardware. Yahoo processes 490 terabytes of data to construct its web index. This index takes 73 hours to build and has experienced a 50% growth in a year. This highlights the issues facing many web-based companies today, and potentially other organizations in the future.  

Although the event was clearly designed to evangelize the benefits of Hadoop, all of the keynotes emphasized interoperability with, rather than replacement of, existing systems. Two relational DBMS connectors were presented at the event including Sqoop from Cloudera and support for the Cloudera DBInputFormat interface from Vertica. Cloudera also took the opportunity of announcing it was evolving from a Hadoop services company to being a developer of Hadoop software.

The track sessions were grassroots Hadoop-related presentations. There was a strong focus on improving the usability of Hadoop and adding database and SQL query features to the system. I felt on several occasions many people were trying to reinvent the wheel and trying to solve problems that had already been solved by both open source and commercial database products. There is a clear danger of trying to expand Hadoop and MapReduce from being an excellent system for the batch processing of vast quantities of information to being a more generalized DBMS.  

The only real attack on existing database systems came surprisingly from the J. P. Morgan financial services company. The presentation started off by denigrating current systems and presenting Hadoop as an open source solution that solved everyone's problems at a much lower cost. When it came to use cases, however, the speakers positioned Hadoop as suitable for processing large amounts of unstructured data with high data latency. They also listed a number of "must have" features for the use of Hadoop in traditional enterprise situations: improved SQL interfaces, enhanced security, support for a relational container, reduced data latency, better management and monitoring tools, and an easier to use developer programming model. Sounds like a relational DBMS to me. Somehow the rhetoric at the beginning of the session didn't match the more practical perspectives of the latter part of the presentation.

In summary, it is clear that Hadoop and MapReduce have an important role to play in data warehousing and analytical processing. They will not replace existing environments, but will interoperate with them when traditional systems are incapable of processing big data and when certain sectors of an organization use Hadoop to mine and explore the vast data mountain that exists both inside and outside of organizations. This makes the current trend toward hybrid RDBMS SQL and MR solutions from companies such as Aster Data, Greenplum and Vertica an interesting proposition. It is important to point out, however, that each of these vendors takes a different approach to providing this hybrid support and it is essential that potential users match the hybrid solution to application requirements and developer skills. It is also important to note that Hadoop is more than simply MapReduce.   

If you want to get up to speed on all things Hadoop, read some case studies, and gain an understanding of its pros and cons versus existing systems then get Tom White's (I am not related!) excellent new book "Hadoop: The Definitive Guide" published by O'Reilly.


Posted October 6, 2009 1:37 PM
Permalink | No Comments |
I commented in a my previous blog entry that the controversy over ParAccel's TPC-H benchmark has become quite heated. This is especially true on Curt Monash's blog where at one point he made some personal comments about Kim Stanick, ParAccel's VP of Marketing. See this link for details.

This is the second blog this month that I have read where an analyst makes an attack, not only on the vendor, but also one of its employees. The other blog (and an associated article) was by Stephen Few entitled "Business is Personal - Let's Stop Pretending It Isn't." See this link for details.

The good thing about social computing is that it provides a fast way of sharing and collaborating about industry developments. However, these technologies have the same problems as e-mail and instant messaging, they enable people to react immediately to something that upsets or annoys them. With blogging, unlike email and instant messaging, everyone gets to see the results!

As analysts our job is to write balanced reviews of industry developments that provide useful information to the reader. My concern is that some analysts are behaving as though they are on cable television or writing for the tabloids. I believe we can critique a product without attacking a company, its products or its employees. Personal attacks by analysts are unprofessional, even if the company fights back against a review they take exception to. What do you think?

Posted June 25, 2009 1:52 PM
Permalink | 7 Comments |
ParAccel, one of the new analytic DBMS vendors, recently announced some impressive TPC-H benchmark results. A good review of these results can be found on Merv Adrian's blog at this link.

Not everyone agreed with Merv's balanced review. Curt Monash commented that "The TPC-H benchmark is a blight upon the industry." See his blog entry at this link.

This blog entry resulted in some 41 (somewhat heated) responses. At one point Curt made some negative comments about ParAccel's VP of Marketing, Kim Stanick, which in turn led to accusations that his blog entry was influenced by personal feelings.

I have two comments to make about this controversy. The first concerns the TPC-H benchmark and the second is about an increasing lack of social networking etiquette by analysts.

TPC benchmarks have always been controversial. People often argue that that do not represent real life workloads. What this really means is that you mileage may vary. These benchmarks are expensive to run and vendors throw every piece of technology at the benchmark in order to get good results. Some vendors are rumored to have even added special features to their products to improve the results. The upside of the benchmarks is that they are audited and reasonably well documented.

The use of TPC benchmarks has slowed over recent years. This is not only because they are expensive to run, but also because they have less marketing impact than in the past. In general, they have been of more use to hardware vendors because they demonstrate hardware scalability and provide hardware price/performance numbers. Oracle was perhaps an exception here because they liked to run full-page advertisements saying they were the fastest database system in existence.

TPC benchmarks do have some value to both the vendor and the customer. The benefits to the vendor are are increased visibility and credibility. Merv Adrian described this as a "rite of passage." It helps the vendor get on the short list. For the customer these benchmarks show the solution to be credible and scalable. All products work well in PowerPoint, but the TPC benchmarks demonstrate that the solution is more than just vaporware.

I think most customers are knowledgeable enough to realize that the benchmark may not match their own workloads or scale as well in their own environments. This is where the proof of concept (POC) benchmark comes in. The POC enables the customer to evaluate the product using their own workloads.

TPC benchmarks are not perfect, but they do provide some helpful information in the decision making process.

I will address the issue of blog etiquette in a separate blog entry.  



Posted June 25, 2009 1:43 PM
Permalink | 4 Comments |
My post a couple of days ago about data warehousing in the cloud led to requests for more information about this topic and related SaaS BI solutions.  

Claudia Imhoff and I recently published a research report on the BeyeNETWORK entitled "Pay as You Go: Software-as-a-Service Business Intelligence and Data Management." The report was sponsored by Blinklogic, Host Analytics, PivotLink and SAP BusinessObjects who offer SaaS BI solutions. It was also sponsored by Kognitio who (like Aster, GreenPlum and Vertica mentioned in my previous blog) have a data warehousing in the cloud offering. The report discusses SaaS BI and data warehousing and reviews the pros and cons of using this type of deployment model.

The report can be found on beyeresearch.com.

Posted June 11, 2009 4:58 PM
Permalink | 1 Comment |
The use of cloud computing for data warehousing is getting a lot of attention from vendors. Following hot on the heels of Vertica's Analytic Database v3.0 for the Cloud announcement on June 1 was yesterday's Greenplum announcement of its Enterprise Data Cloud™ platform and today's announcement by Aster of .NET MapReduce support for its nCluster Cloud Edition.

I have interviewed all three vendors over the past week and while there are some common characteristics in the approaches being taken by the three vendors to cloud computing, there are also some differences.

Common characteristics include:
  • Software only analytic DBMS solutions running on commodity hardware
  • Massively parallel processing
  • Focus on elastic scaling, high availability through software, and easy administration
  • Acceptance of alternative database models such as MapReduce
  • Very large databases supporting near-real-time user-facing applications, scientific applications, and new types of business solution
The emphasis of Greenplum is on a platform that enables organizations to create and manage data warehouses and data marts using a common pool of physical, virtual or public cloud infrastructure resources. The concept here is that multiple data warehouses and data marts are a fact life and the best approach is to put these multiple data stores onto a common and flexible analytical processing platform that provides easy administration and fast deployment using good enough data. Greenplum sees this approach being used initially on private clouds, but the use of public clouds growing over time.

Aster's emphasis is on extending analytical processing to the large audience of Java, C++ and C# programmers who don't know SQL. They see these developers creating custom analytical MapReduce functions for use by BI developers and analysts who can use these functions in SQL statements without any programming involved.

Although MapReduce has typically been used by Java programmers, there is also a large audience of Microsoft .NET developers who potentially could use MapReduce. A recent report by Forrester, for example, shows 64% of organizations use Java and 43% use C#. The objective of Aster is to extend the use of MapReduce from web-centric organizations into large enterprises by improving its programming, availability and administration capabilities over and above open source MapReduce solutions such as HADOOP.

Vertica see its data warehouse cloud computing environment being used for proof of concept projects, spill over capacity for enterprise projects and for software-as-service (SaaS) applications. Like Greenplum it supports virtualization. Its Analytic Database v3.0 for the Cloud adds support for more cloud platforms including Amazon Machine Images and early support for the Sun Compute Cloud. It also adds several cloud-friendly administration features based on open source solutions such as Webmin and Ganglia.

It is important for organizations to understand where cloud computing and new approaches such as MapReduce fit into the enterprise data warehousing environment. Over the course of the next few months my monthly newsletter on the BeyeNETWORK will look at these topics in more detail and review the pros and cons of these new approaches.


Posted June 9, 2009 12:00 AM
Permalink | 3 Comments |
PREV 1 2