Big Data Integration with Hadoop: A Q&A Spotlight with Yves de Montcheuil of Talend
by Ron Powell
Originally published March 5, 2012
BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.
Presented as Q&A-style articles, these interviews conducted by the BeyeNETWORK present the behind-the-scene view that you won’t read in press releases.
This BeyeNETWORK spotlight features Ron Powell's interview with Yves de Montcheuil, Vice President of Marketing at Talend. Ron and Yves discuss how the Talend Open Studio for Big Data leverages the power of Hadoop for data integration and data quality, and Talend's partnership with Hortonworks.
Yves, you recently announced Talend Open Studio for Big Data, and the world is really energized about big data. Could you tell us if this is Talend's first step into the world of "big data"?
Yves de Montcheuil: It's not our first step into the world of big data. Talend has been working with big data even before it was called big data. We have had connectivity for Hadoop. We have had the ability to load data, to process data within Hadoop for over two years. We've been partners with the early vendors in the Hadoop space – for example, Cloudera. We've been present at many of the Hadoop events in the past few years. So clearly, Talend is not a new entrant into the big data space. What we are doing with Talend Open Studio for Big Data is that we are putting all of our big data capabilities into the same product. That's, of course, data integration, which is obviously a key element, but also other features such as data quality and cleansing apply to big data – the same kind of rules and filters that you would apply to conventional data.
When I look at the big data world, we hear a lot about Hadoop. What benefits does Talend Open Studio for Big Data bring to the users of Hadoop?
Yves de Montcheuil: The goal of Talend Open Studio for Big Data is to democratize the deployment of Hadoop to leverage big data. Talend was originally founded on the promise of democratizing integration, and we've been extremely successful at that, especially when it comes to integrating databases, applications, cloud, SaaS, etc. Hadoop introduces very high complexity to what you need to design in order to extract value – to extract information – out of that massive amount of data. Typically it would take something akin to a PhD in MapReduce, and I don't think that PhD has been invented yet. What we are offering with Talend Open Studio for Big Data is the ability to very easily design big data integration and big data quality jobs, connect to sources, connect to targets, get data into Hadoop, process data into Hadoop. In other words, not only integrate it with the rest of the enterprise IT stack – you might want to get data out of Oracle or Salesforce.com, and get the resulting data into Teradata or into QlikView – but also prepare the data, process it directly inside Hadoop. Don't use Hadoop only as a place to store information, but also use it for what it is: an engine, an extremely powerful and scalable engine to process information. Again, without having to write the MapReduce code, we can abstract those transformations through our graphical interface with simple drag and drop of components, and the underlying code is generated automatically.
Talend Open Studio for Big Data is fully integrated with the Apache Hadoop stack. It’s available under an Apache license, which makes it compatible at the license level with the Hadoop products.
We also announced recently a partnership with Hortonworks, one of the leading providers of Hadoop distributions – the Hortonworks Data Platform. Talend Open Studio for Big Data is now embedded into the Hortonworks Data Platform, and is clearly the reference tool for integrating, for moving and for transforming big data into Hadoop.
Talend’s roots are open source, and big data’s roots are in open source too. It would seem to me that you would have an edge over the competition that does not have your open source roots. Would you agree with that?
Yves de Montcheuil: It’s big data without the big bucks. You want to be able to get the benefits of big data without having to put millions of dollars on the table. And, frankly, a lot of companies have been doing big data for quite some time, but they have been doing it with conventional technologies. You know you can process massive amounts of data with, for example, a Teradata data warehouse, which is an extremely powerful technology, but also an expensive solution. Hadoop changes the game. It brings big data to the masses, and that’s thanks to the open source nature of Hadoop.
Beyond big data integration, is there a requirement for big data quality?
Yves de Montcheuil: There is absolutely a requirement for big data quality. If you are just processing and moving big data without introducing the quality dimension into it, you’re just shoveling heaps of garbage around. So what you want to do is cleanse and enrich the data the same way you would do it for small data or conventional data. I think today anybody who does business intelligence or data warehousing clearly understands the requirement of ensuring the quality of the data. The same holds true for big data except that it’s to the power of ten – at least! – because the data sets are much larger and more complex. If you don’t apply proper data quality, proper data hygiene, to your big data, you’re going to end up with a much bigger problem than what you would encounter in the conventional data world.
In order to do big data quality, one avenue that we are taking is to leverage Hadoop for CPU-intensive data quality functions. Features such as matching, deduplication, and linking of records can consume enormous amounts of resources. Because MapReduce is such a scalable architecture, we have taken the approach of generating Hadoop code in order to perform the data quality functions right inside of Hadoop.
That’s an ingenious way to do it. They always say “Do it at the source,” and you can’t do it any closer to the source than by doing it in Hadoop.
Yves de Montcheuil: It’s doing it wherever it makes the most sense. It can be the source, it can be close to the target or it can be an intermediate engine. The key is to process the data where you have the ability to process it the best.
Good point. Can you share with us some use cases for big data integration?
Yves de Montcheuil: Some of our customers are doing very interesting things with big data. One that comes to mind is a telco company. They’ve been doing traditional data warehousing and business intelligence for a long time. In addition to those conventional sources that are coming from their customer applications, from their billing systems, etc., they are now doing sentiment analysis and monitoring social media – Twitter, Facebook – for people who are talking about them and their services. They are aggregating this information using Hadoop technology alongside a data warehouse that resides in one of the traditional data warehousing platforms. It’s really a very interesting use case where big data actually complements the traditional data warehouse that is in place.
Another use case we are encountering is when Hadoop is used essentially as an auxiliary ETL engine, but one that is on steroids. You have this big scale-out architecture that’s your Hadoop cluster, which gives you the ability to process very large amounts of data, aggregate it, and perform mathematical or statistical calculations on those records. By using Hadoop as the ETL engine that will then load the traditional data warehouse, some of our customers are actually decreasing, by 2 or 3 orders of magnitude, the amount of time it takes to process the raw data and load the data warehouse. They are able to get much closer to real-time data warehousing than they were before.
That’s excellent. Yves, thank you for bringing our readers up to speed on your big data initiatives.
Recent articles by Ron Powell
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC