We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Accelerating “Big Data” Analytics: A Spotlight Q&A with Roger Gaskell of Kognitio

Originally published June 25, 2012

BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.

Presented as Q&A-style articles, these interviews conducted by the BeyeNETWORK present the behind-the-scene view that you won’t read in press releases.

This BeyeNETWORK spotlight features Ron Powell's interview with Roger Gaskell, CTO of Kognitio. Ron and Roger discuss why big data and Hadoop are driving more companies to adopt analytic accelerators.

Roger, how have the analytical needs of companies, even those who have implemented enterprise data warehouses, evolved in the last 24 months or so?

Roger Gaskell: Well, a number of things have changed dramatically in the last couple of years, and I think the most obvious to everyone is the concept of big data. Many people have hijacked the phrase to mean many different things, but at the end of the day it represents the fact that organizations are trying to gain as much competitive advantage by using as much data as they possibly can. They're using all the data they have plus data that they can get from other sources just to give them an advantage. They're no longer willing to use samples of their data. For many years, we’ve been talking about increasing data volumes, but now it is really exploding. Obviously, that's given a lot of people in organizations a big headache as they try to deal with those data volumes.

The second thing that has changed is the complexity of the analytical operations that organizations are trying to do. Simple reporting is no longer good enough. They're trying to apply complex algorithms to their data to truly understand what's going on in the data. That has made a requirement for raw CPU power. They have to crunch through a lot of data to do these analytics as opposed to just selecting some rows from a dataset. They need to select a lot of rows, and then they need to do some complex crunching on it. The demand for raw CPU power has gone up because these operations tend to be CPU intensive.

The other key issue is what we call latency. It's no longer acceptable to do analytics on data that is out of date, and “out of date” is becoming a shorter and shorter time period. Whereas two, three, maybe four years ago it was okay to calculate your risk in a financial organization on a quarterly basis, more recently they're moving to monthly, weekly, and even daily calculating of that risk. We're even talking to organizations who want to be able to calculate their risk almost on an ongoing basis.

The old way of doing analytics was to pull your data out of your data warehouse and build big pre-aggregated OLAP cubes that were effectively a snapshot in time. A typical cube from a large dataset may take many hours to create so it isn’t done very often. Then you let people analyze that data in that snapshot that, in some cases, may be a week or month old. That model just doesn't work anymore. People need to do their analytics on data that is up to date. The phrase “up to date” has different meanings for each organization, but I think everybody is moving toward near real-time, or business real-time, or whatever phrase is best to describe the use of data that is not historic.

Roger, you have painted quite a picture for big data. You’ve talked about complexity, the need for raw CPU power, the need for decreasing latency of the data, and the ability to look at risk in an incredibly short window. Where do you see these analytical needs moving forward in the next 24 months?

Roger Gaskell: Well, I think the latency issue will just get shorter and shorter until everybody is trying to do things in near real time. As I said previously, I think the meaning of near real time varies from business to business. To some people it’s milliseconds, to others it’s seconds, and to others it’s hours and days, but they're all going to want to do it more frequently.

The other thing that's happening is this big data explosion. This requirement to do everything with every scrap of data you can get your hands on is making more and more people look at alternative technologies for storing that data. They have determined that if they want to keep and analyze all of their data, their data warehousing license costs would go through the roof.

I don't think anyone is considering throwing out their data warehouses, but they are looking at how they can use technologies such as Hadoop alongside their data warehouse to take some of the load and allow them to use different types of data that aren’t ideally suited to data warehouses.

The organizations I've been talking to are looking at how they can use Hadoop or similar types of open source infrastructures alongside their data warehouses. But what’s interesting is that creates data in different places. Then when they want to do analytics against that data, they need someplace where they can pull that data together. That's going to create a big demand for an analytical layer that can sit above a Hadoop cluster, or a data warehouse, or wherever else the data is persisted. The analytical layer on top is where the datasets are moved to so people can perform analytics against it.    

Roger, are you seeing a lot of interest from your customers to integrate Kognitio with Hadoop?

Roger Gaskell: Yes. Hadoop will be huge for us, but Hadoop is in its pretty early days. A lot of the people who are looking at Hadoop are literally “looking” at it. They've heard the name, and they think they should be doing something with it but they're not quite sure what to do with it.

However, those that have implemented Hadoop systems and gained a lot of advantage from a Hadoop system have realized that one of the things that you can't do with Hadoop is ad hoc interactive analytics. There are tools, but they're relatively primitive.

Also, Hadoop doesn’t have the performance to allow someone to do interactive analytics where they ask one question based on the answer to the previous question. That type of analytics requires a response time of a few seconds. You cannot do that if the response times are tens of minutes, and Hadoop is more of a batch-processing environment where you set up a series of jobs and let them process. That just doesn’t fit with the concept of ad hoc interactive analytics.

Other customers have the idea that they can store all their data and process it in Hadoop, but then they're looking for something to allow them to do analytics. In-memory is a very natural fit for that. You keep it on disk within Hadoop, and you move it into memory to do the analytics. You can bring some more data in, and you can change it as often as you wish. This is an area that's going to be very important for Kognitio going forward.

Obviously, performance is really the key when we’re talking about big data. The concept of having an in-memory analytical platform has always been key as latency is greatly reduced, but typically it was expensive to do that. What do you think has changed in the market to make an in-memory analytical platform a better choice?

Roger Gaskell: The key, I believe, over the last two years is that the volume of memory you can get in a very cheap industry standard, off-the-shelf server has increased dramatically. If I think back to two years ago when we built our appliances, we were buying servers with 32 GB of RAM, and now for the same price we're buying servers with a quarter of a terabyte of RAM and 32 CPU cores. The hardware that we're using to build these platforms now is unbelievably cheap. It’s the same reason Hadoop is taking off. The use of low cost, industry-standard hardware for the analytical in-memory platforms and bringing lots of it together to make very large systems provides a very cost-effective platform.

But it's not just the ability to put data into memory that gives you the performance from these in-memory analytical platforms. Putting data in-memory is just another place that you can park your data. What makes it fast is the fact that you put it in memory and you're no longer limited by the disk I/O speeds. You can get back the data fast enough that you can drive as many CPU cores, assuming you can fully parallelize your query, at full speed for any given dataset. As a result, you're splitting the workload across lots of CPUs, and you're keeping them 100% busy for the duration of your query. I think that's the interesting fact that people are going to realize over the coming year about in-memory platforms. It's not in-memory that makes them fast. It’s the amount of CPU power you can bring to bear against the data, assuming you have an environment that's fully parallelized.

What types of companies do you feel will benefit from this type of analytical acceleration?

Roger Gaskell: Basically any company that's trying to work with big data and low latency would benefit, but the markets where we are seeing interest are around risk and retail analytics. Those seem to be the key ones. Another important area for us is companies that are making money from selling data. These companies acquire data from various sources, they combine it, and they sell it. Their revenue comes from the fact that the higher the throughput they can get, the more money they can make out of a given system. Performance is key for them because it equals throughput.

Another area where we are seeing a lot interest around the idea of an analytic accelerator for Hadoop. Companies are realizing it's not just a question of performance. With Hadoop, they have issues with connecting the front-end tools that people have on their desktops all around the business. These tools don’t easily connect directly to the Hadoop systems they're building. Putting an in-memory analytical layer in the middle gives them the performance benefits. It also gives them the ability to connect all those tools to the data in Hadoop. The in-memory analytical layers fully support standard SQL and MDX type connectivity, virtualized cubes and everything you need to support those tools. Performance is a big thing, but connectivity to industry-standard tools is another thing that's driving people to look at putting in an in-memory layer.

You mentioned customer behavior. I would assume you're seeing a lot of your customers trying to integrate the social platforms like Facebook and Twitter to gain additional insight.

Roger Gaskell: One of the key drivers in the big data space is trying to integrate sentiment data as well as day-to-day transactions that go through the business. As I said earlier, companies today are trying to incorporate as much data as they can to make better decisions. Social media is one of the external data sources that people are trying to integrate, and a lot of people are being driven down the Hadoop route for that because the data volumes are so large and the data is unstructured. Hadoop is a very natural fit for that.

Roger, are there any specific platforms that work best when using an analytic accelerator?

Roger Gaskell: The one that seems to be really interested in analytic accelerators – because they have a number of problems they need to solve – is the Hadoop space. As I said earlier, they don’t have the performance to do interactive ad hoc analytics in Hadoop. Secondly, they don’t really have the ability to connect the tools that they use for analytics to Hadoop, and they don’t really want to go around to all the users and tell them to use something different to do their work. As a result, analytic accelerators are gaining a lot of acceptance in that space. The answer to that question is definitely Hadoop, but it’s Hadoop alongside other data warehouses. It's not a matter of having to move everything into Hadoop to use an in-memory analytical layer. You can drive your data from Hadoop, you can drive it directly from your data warehouse or, more likely, you can drive it from both.

Roger, Kognitio has been doing in-memory going way back to the mid-to-late ‘80s. It seems the market has finally caught up to Kognitio.

Roger Gaskell: When I joined Kognitio back in 1988 to develop an in-memory analytical database from scratch, I thought we were being reasonably innovative. At that time, I didn’t understand that it was 23 years ahead of its time. It’s nice to say that we had a huge amount of vision, but I think we expected the market to get it much quicker than it did, but I'm grateful it has finally happened. One of the key reasons is the price of the hardware has come down dramatically. Secondly, we have to thank SAP because they have spent millions of dollars telling everybody, quite correctly, that in-memory is the way to solve this analytical performance problem.

We have talked a lot about hardware, but really the software is the key. Your platform doesn’t require the typical database work necessary with many of the other options available today, correct?

Roger Gaskell: Correct. There's no magic. We've replaced complexity with horsepower by putting the data in-memory and allowing us to use loads of low cost CPUs against the data. What we think about in terms of performance of these platforms is how much data each CPU core has to look after and how much work is each CPU core doing when you fully parallelize the query. It's pretty important to us that you can fully parallelize everything so that you can get those levels of performance. But when you've done that and you've brought that huge amount of CPU power to bear against the data, you then don’t have to do things like index single complex partitioning, or hints, and other tricks, or doing things like changing data structures and columnizing it to get the performance. You just don’t have to do that because you have the cheap CPU power. You can actually crunch through the data so quickly that you can look at everything every time you run a query. It just makes things a lot simpler from an administration point of view.

Roger, could you give us some examples of the industries where Kognitio is most heavily implemented? I think that would be great for our audience.

Roger Gaskell: Okay. We’re seeing success with analytical service providers – those companies that bring data from different sources and then sell it. We have a number of examples of that. We are also seeing increasing amounts of success in the risk analytics space. In that space, people want to be able to calculate risk on a more frequent basis, almost to the point where they know their current risk at any point. That requires huge amounts of power because the calculations are quite complex. But the biggest vertical for us currently is the retail analytics space around customer behavior.

One of our customers is a company called Aimia. They basically have a retail analytics application, which is powered by the Kognitio in-memory analytical platform. They bring all of the point-of-sale data from the retailers – every single transaction for long periods of time, two, three, or five years of time. They put it in-memory, and they combine it with loyalty card data and demographic data so that they can do some very complex analysis about why people are buying and what they're buying. That particular type of analysis is a difficult thing to do because you end up joining the big fact table back on itself. You get tables of hundreds of billions rows being joined back on themselves. It’s a difficult thing to do. But because of the performance of in-memory and the power it brings, they can actually do that with 100% of the data, and then they can sell access to that data back to the retail suppliers. For example, if you're Pepsi or Coca Cola and you're selling to a big retailer, you can pay to go onto these systems, and you can see when people are buying your products and what they're buying with them at that particular retailer and do fairly complex analysis. The important thing for Aimia is the more people they can have working into that system and the more reports they can generate, the more money they can make. By putting the data in-memory, they can do very complex market-basket analysis type reports with response times of sub-two minutes. They can get thousands of users registered on any one particular system, with each user paying a subscription fee to be there. The power of in-memory combined with huge processing power allows them to generate tens of thousands reports every day, and every report is worth money to them.

Roger, based on the experience of your customers, could you give our audience an idea of how large these in-memory systems can be?
 
Roger Gaskell: The average system for us is probably about eight to ten terabytes of data physically in-memory. We have customers that have a lot more than that, but they're all multi-terabyte. This is not about small gigabyte datasets. We have built platforms with tens of terabytes in-memory. Tens of terabytes is becoming more and more common now, and I anticipate in a year or two we’ll be talking hundreds of terabytes.

That is amazing. Roger, thank you for providing our readers with such great information and  insight into Kognitio and analytic accelerators.

  • Ron PowellRon Powell
    Ron is an independent analyst, consultant and editorial expert with extensive knowledge and experience in business intelligence, big data, analytics and data warehousing. Currently president of Powell Interactive Media, which specializes in consulting and podcast services, he is also Executive Producer of The World Transformed Fast Forward series. In 2004, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010.  Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). He maintains an expert channel and blog on the BeyeNETWORK and may be contacted by email at rpowell@powellinteractivemedia.com. 

    More articles and Ron's blog can be found in his BeyeNETWORK expert channel.

Recent articles by Ron Powell

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!