We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


SPOTLIGHT: Big Data Integration Q&A with Bob Eve of Composite Software

Originally published August 10, 2011

BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.

Presented as a Q&A-style article, these interviews with leading voices in the industry including software vendors, end users and independent consultants are conducted by the BeyeNETWORK and present the behind-the-scene view that you won’t read in press releases.

This BeyeNETWORK spotlight features Ron Powell's interview with Bob Eve, Vice President of Marketing for Composite Software. In this interview, Ron and Bob discuss big data integration and Composite’s Next Generation Data Virtualization Platform, Composite 6.

Let's begin by having you tell us how you define big data integration.

Bob Eve: Sure Ron, that's a great place to start. When we think about big data, we like to think of it in terms the big data part and then the data integration part. Let's first talk about big data. There are two types we run into. One is the kind of big data held in data warehouse appliances such as Netezza and Vertica. The other is data in those new NoSQL types – for example, Hadoop. Now in both cases, the business value of big data comes from the ability to do analytics on that big data. In particular, we're seeing a lot of what we call predictive analytic use cases. Predicting customer churn or the possible results of an upcoming marketing campaign are just a couple of examples.

Typically these analytics are very tightly bound to the data store that has been purposely selected for that type of analysis. For example, clickstream analysis in a Hadoop data store would be a good match. In these cases, the data integration challenge really isn't to integrate between the analytic and the specific data store; rather, the challenge is to take that information and integrate it with the rest of the enterprise. For example, if we have the clickstream data in Hadoop, we also might want to have sales activity data from Salesforce.com, shipment transaction data from SAP, customer master data from Teradata, and email campaign data from Unica. If you integrate all of those different pieces of customer-related data, you can really gauge the effectiveness of a marketing program. The integration of big data is bringing big data silos and these new analytics together with the rest of the enterprise.

Well, that sounds great. I understand that Composite 6 provides big data integration support for Cloudera Distribution including Apache Hadoop. Where are you seeing specific customer demand for Apache Hadoop-based solutions?

Bob Eve: We're seeing acceleration of Hadoop demand as some of the pioneering analytical work done in the Web 2.0 companies really moves into the Fortune 2000 type enterprises. For some great examples of that, I recommend that readers check out the customer section of the Cloudera website, where they have a number of broad use cases – more than even Composite has seen. As these analytic cases are deployed into these companies – whether it's clickstream analysis of the website, tracking of location data, tracking of barcode data in the logistic environments, etc. – the next logical step is integrating that data with the rest of the enterprise data. That's what our customers are looking for us to help them accomplish.

Well Bob, it's becoming apparent that the incorporation of Hadoop data can provide greater business insights to an enterprise. What kind of insights does it provide your customers and how does Composite 6 help?

Bob Eve: Let me give you an example from one of our customers, one of the large mutual fund companies based in Boston. They do Wall Street Journal ads that cost tens of thousands of dollars to place. They want to know if these ads are effective. How do they know? One way to measure the effectiveness of the campaign or the ad is to see whether web traffic for the advertised fund goes up after the ad hits. They could do some clickstream analysis of web data, record that in Hadoop and get an answer. But it wouldn’t tell them if the inflows into the fund increased as well because they track fund balances in a different system. To measure the campaign’s effectiveness, they need to know both.

With Composite 6, they can create a high performance federated view across the unstructured clickstream data in Hadoop and the structured data in the funds transaction management system. They can do all of that in SQL as opposed to having to do it with MapReduce in the Hadoop environment, which has a limited set of users and an obscure set of APIs. The federated approach is much easier because everybody in the IT world understands SQL and all the analysis tools work with SQL. So we've really simplified that. If you step back from this particular example of the mutual fund into the more general case, what we did with Composite 6 is provide an enterprise-ready, SQL-centric high performance interface to Hadoop that eliminates a lot of the extra complexity that occurs with MapReduce and Hadoop data stores. Composite 6 makes that information available to a wide range of analytic tools such as SAS.

I look at performance being a key consideration because in the past, without Composite's capability, it would take a long time for a mutual fund company to make the type of determination you described. It sounds that with Composite 6 they’re able to do this much more quickly? Is that right?

Bob Eve: Well, there are two aspects to the time horizon. One is the time to build new solutions. We provide a drag-and-drop development approach. It's reusing existing queries to some of the other data sources, bringing new queries into that environment, joining them, doing some analysis, learning more, changing the analysis, and moving quickly. It’s a very agile time-to-solution approach.

The other aspect that comes into play is that with big data you do need higher performance, so you definitely need more powerful queries. But that's a good thing because that's an area where we're very strong because we've really focused on it for the last ten years. Greater performance also enables you to handle bigger data sets so you can get agility in terms of time-to-solution and high-performance queries to handle the larger “big data” data sets.

I know Composite 6 was a very big release. What else is new in this release?

Bob Eve: I mentioned performance a little bit. We're always pushing the envelope on performance, and we've done some building on our work last year with Netezza and the Composite Information Server for Netezza, which they resell. We've added some capabilities to utilize Netezza as a cache data store. Sometimes users materialize a view and then want to store it for reuse or to speed queries later. With this release, we have provided that capability as well as the ability to leverage all of Netezza's new analytic functions.

We've added a number of analytic functions ourselves so that we can have our optimizer automatically decide when to use which capability in order to get the highest performance – whether to use our analytic functions or the source data analytic functions. We've extended a join technique that we pioneered with Netezza last year, and that now works with Sybase and Oracle so we can take advantage of their bulk extract and bulk load capabilities for certain types of joins. We’ve added a number of enterprise features as well because with the broadening adoption we need to provide more role-based use of the development and user interfaces. We've done a lot to simplify and accelerate data modeling and prove visibility for data governance.  You can see we’ve added a number of enterprise features.

Well Bob, you bring up an interesting thought here as obviously the number of data sources continues to expand. I keep looking for them to consolidate, but it really continues to play into your product's capabilities because it seems to me that we're constantly creating new data sources and the ability to join them and to get information across this heterogeneous network is required everywhere. Is that a safe statement?

Bob Eve: Yes, I think the concept of the unified individual enterprise data warehouse is really going to be tougher and tougher to achieve with these proliferating data silos. These Hadoop, NoSQL type stores are very much fit for purpose for those types of analytics. We were just at Netezza's Enzee show and learned that Bank of America has more than 40 Netezza appliances. Those are going to be physically distributed with each appliance solving different problems. They’re not going to buy a big uber appliance to go over the top of and integrate all of the information. So how do they deal with all these silos? I think people are starting to recognize that it's great to have data warehouses, but it's also necessary to have these silos and analytic data stores. They proceed on that path and provide those business solutions quickly. Then when they want to integrate across those data stores, data virtualization is a great way to do it.

With all of the technical advances, are we starting to see a blurring of the line between physical and virtual data implementations?

Bob Eve: Great question, Ron. It's really insightful on your part. I think you're spot on. Ten years ago, information architects had one practical approach for combining all these disparate data sets, and it was data consolidation into a data warehouse or mart of some nature. Over the past ten years, data virtualization companies like Composite have emerged to provide this other option, which is the virtual pulling together of pretty big data sets on the fly. This provided the information architects with a second solution design option. Now I think there's a third in the middle that sort of does blur that line between physical and virtual. In data virtualization, you can cache a data set. The best way to think about it is as a pre-materialized view. You do that in order to stage frequently used data and therefore accelerate queries.

Well, there are so many options that Composite 6 provides around these caches today. Where do you want to store the cache? We can store its caches in seven different systems, such as Netezza. We can store in a file system like Oracle's Coherence option. When do you want to update the cache? How often? We have dozens of rules for updating caches. How do you refresh the cache? Do you refresh the entire cache or do you incrementally refresh using change data capture technology? How do you want to distribute and synchronize that cache information around the world? Similar to a replication environment, you can distribute caches globally and have near real-time response around the world. So think of all those caching options. Is it physical, is it virtual or is it kind of halfway in between? As a result, information architects have this third option, and they can mix and match to create these solutions. There is a lot of flexibility and a blurring of the lines between pure physical on one extreme – with the data warehouses – and purely virtualized on the other. I think it's a great thing for the information architects and the businesses that rely on them.

Well Bob, I couldn’t agree with you more. It's just amazing how every year there always seem to be new advances and greater technology options introduced. There's no point where we can just stay the course anymore. I really appreciate you taking the time to talk with me today.

Bob Eve: Well, it's great, Ron, and I think the BeyeNETWORK Spotlights of these technologies are providing a good service to your readers.

  • Ron PowellRon Powell
    Ron is an independent analyst, consultant and editorial expert with extensive knowledge and experience in business intelligence, big data, analytics and data warehousing. Currently president of Powell Interactive Media, which specializes in consulting and podcast services, he is also Executive Producer of The World Transformed Fast Forward series. In 2004, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010.  Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). He maintains an expert channel and blog on the BeyeNETWORK and may be contacted by email at rpowell@powellinteractivemedia.com. 

    More articles and Ron's blog can be found in his BeyeNETWORK expert channel.

Recent articles by Ron Powell

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!