We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


In-Memory Data Grids for Analytics on Fast-Changing Data: A Q&A with William Bain of ScaleOut Software

Originally published December 10, 2012

This BeyeNETWORK spotlight features Ron Powell's interview with William Bain, CEO of ScaleOut Software. Ron and William discuss the many capabilities of in-memory data grids and how they are being used to perform analytics on rapidly changing operational data. 

Bill, for our readers who are not familiar with ScaleOut Software, can you give us an overview of the company?

William Bain: I founded ScaleOut Software in 2003. I previously worked at Microsoft and prior to that I was at Intel’s Supercomputer Systems Division and Bell Laboratories. We introduced our product to the market in January, 2005, and we have products available for both Windows and Linux systems. Our mission is to help customers scale the performance of their applications that handle fast-changing data and help them perform analytics on that data. 

I have had a career focus in parallel computing, and so what I am trying to do with this company is to bring parallel computing techniques to the problems of scaling application performance and analytics. That is really what ScaleOut Software is all about.

Bill, ScaleOut Software utilizes an in-memory data grid. Can you explain to our audience what that is?

William Bain: An in-memory data grid (IMDG) is a software technology – middleware software actually – that helps applications scale performance and perform analytics on fast-changing data. It is designed to run across a set of servers. An application accesses an in-memory data grid just the way it would access a file system or a database server. The IMDG’s goal is to offload database servers and file systems by holding data in-memory, and stored data is replicated to other servers to maintain high availability in case of a failure. It allows the application to store fast-changing data with scalable capacity and throughput, and it makes the data shareable by all servers that are accessing it. An IMDG solves the problem of shipping data between servers and also solves the problem of handling a large volume of fast-changing data without having to push that data into a database server or a file system to hold it. 

An in-memory data grid has two other important capabilities. It can allow data that is changing rapidly to be accessed from multiple geographic sites or replicate data to remote sites for disaster recovery. It also incorporates an integrated analytics engine so all of the data that’s churning inside of the grid can be analyzed in real time while it is being updated by the application.

Our audience is very focused on analytics. What makes the in-memory data grid a good foundation for analytics computing and what are the benefits?

William Bain: Bain: Well, it's an unusually good platform for analytics because it can work on fast-changing data. It allows users to perform “map/reduce”-style computations on fast-changing data in real time. It analyzes what is called operational data – data that an application is changing over the course of the workday in its business logic code. An in-memory data grid integrates very nicely into business logic because by using an object-oriented programming model, it easily allows the application to update the data and to analyze it using object-oriented techniques. It gives users a simple, powerful analytics platform that's integrated with the operational use of data.

The product is a pure in-memory product. Obviously, when we talk about analytics and big data, aren't most data sets too big to fit in-memory?

William Bain: Well, actually no, they’re not too big. Something like 60% of all data sets will fit in ten terabytes of memory or less according to Ovum, and the average data set size is about three terabytes. We have more than 350 customers worldwide, and what we find is that their operational data fits nicely into memory. We've done some tests that show that an IMDG can hold a terabyte of data, for example, in 64 servers while maintaining full high-availability through data replication. You can easily hold up to ten terabytes with in-memory techniques. We expect that over the next few years, as people incorporate solid-state devices into in-memory data grids, the range for this technology will expand greatly and be able to handle even larger data sets.

Bill, I find that absolutely amazing that we can hold ten terabytes in-memory today. That is a far cry from what we could have done even four or five years ago.

William Bain: Absolutely, that's true. I remember at Intel Supercomputers in the late 1980s that we had announced a parallel file system with disks each holding 540 megabytes. The whole system was nine gigabytes, and we thought that was a huge amount of storage. To be able to hold many times that amount of data in-memory today is really amazing. 

There are a lot of analytic platforms on the market today. Can you describe the specific capabilities and usage characteristics of the ScaleOut technology and what sets you apart from the competition?

William Bain: Well, really one of the key differences is that IMDGs are designed to hold operational data, that is, data that's changing rapidly, and so we've designed our analytics platform and our grid overall to avoid many of the overheads that are inherent to other analytics platforms. For example, we avoid batch scheduling; our IMDG is not a multi-tenant platform. It typically runs one application, which may have many subparts to it but is an integrated application that is performing updates to data, reads of data, and also map/reduce-style analytics on that data while it’s changing. More importantly, we avoid data motion so we can take advantage of the grid's dynamic load-balancer, which evenly spreads the data out across the set of servers that form the grid. We can take advantage of load-balancing to avoid moving data within the grid when performing analytics. And what our tests have shown is that the more you can avoid network and file I/O overheads to access data, the faster your analytics will run. As a result, we see map/reduce response times on the order of seconds instead of minutes or hours. So an application can repeatedly and continuously perform map/reduce on this fast-changing data, spot trends very quickly, and be able to react to those trends without delay.

And that's why the ability to do it in real time, which in many ways and for many applications is really critical from an analytics perspective.

William Bain: Absolutely.

Since approximately 70% of the enterprises comprising our audience are in the early adoption phase with Hadoop, can you talk about the ways your approach is different?

William Bain: Bain: Sure. Notice I say map/reduce-style. That is because our map/reduce algorithm is a bit simplified from the Hadoop approach. First of all, stored data is object oriented so that it integrates well with business logic instead of with file systems. For example, we avoid the use of what Hadoop calls “record readers” to parse files and to provide data for analytics. Instead, we use a very simple object-oriented query of data held in the grid. The IMDG performs queries based on object properties, so it's a very natural way to select data for analysis and it requires very little training to learn.

Also, we have reworked the reduction side of the Hadoop algorithm to simplify it. Drawing upon techniques that I learned in parallel computing going back into the mid-‘80s, we found that performing a binary merge instead of using a multiple reducer model like Hadoop can really simplify the algorithms needed by the application to combine all the data that has been analyzed and report it back to the source. As a result, our IMDG uses a binary merge reduction model that eliminates the complexity of multiple key value spaces and multiple reducers that are used in Hadoop programs. And as a result, we also can accelerate performance because our IMDG provides a spanning tree to combine data across multiple servers in a binary merge and report the data back to the source of the map/reduce. 

The result of all this is a very simple model that is self-tuning and requires very little training. It delivers high performance without the need for the application developer to tweak the application multiple times to eliminate bottlenecks. We also like this analytics model because we find that our customers can easily migrate to it from their operational use of the in-memory data grid and quickly obtain fast results.

If an enterprise would like to implement an in-memory data grid with ScaleOut Software, how long would it take? Are we talking days or months to implement the initial application?
 
William Bain: Well, the installation of the product is measured in minutes, about ten minutes per server. It is designed to be as easy as possible. We use several techniques in the product to avoid the need for the user to have to worry about what we call the “plumbing” of an in-memory data grid. So for example, the grid servers are self-aggregating; they find each other and form a grid automatically. They load-balance data and store copies automatically. If a failure occurs, they automatically detect it and self-heal to restore the redundancy of the grid. So from an operational perspective, it's very easy to bring up an in-memory data grid and put it into operation. 
 
From the developer’s perspective, it's also simple to use a grid because we're leveraging well-understood object-oriented techniques from Java and the .NET world with C#. That means programmers who understand Java and C# can quickly store data in the grid and perform map/reduce-style analytics without having to learn new techniques. The net result is that in a few hours any competent Java or C# developer can have an IMDG up and running and perform analytics on it.

Could you provide some specific customer examples or use cases that have benefited significantly from using your in-memory data grid?

William Bain: As I mentioned, we have more than 350 customers. To give you a few examples, we have a very large retailer that uses an IMDG to hold all of its shopping carts. It uses the grid primarily for scalability - not for analytics - to scale performance by holding a very large number of shopping carts as its user base grows and to ensure fast response times. With an IMDG it also has the ability to store these shopping carts in multiple cities so that if there's a disaster like the recent Hurricane Sandy, it quickly is able to shift its operations to another data center without the loss of hot data that's been changing as customers shop. 

Another example would be a large, very well-known financial services company with more than 1,000 servers across six product divisions running our product. We're used in a data warehouse for holding financial news and headline news and also for financial advising. So the grid is used for both performance scaling and for column-based analysis of data within the grid. 

We have a large, well-known airline that's holding four days of reservations in the grid, two days forward and one day back plus the current day. They are using an IMDG not only for performance scaling but also to be able to look at the data to find problems and quickly react to them, for example, to find all the passengers affected by a flight delay. The ability to perform map/reduce-style analytics on data that is rapidly changing in an airline reservation system is a real advantage of this technology. 

We have other applications such as hedge funds and credit unions. So for example, one of our customers is a credit union that wants to maximize the return on its loans and minimize risk by analyzing how its loans would be deployed to customers with different credit scores. It wants to be able to use map/reduce-style analytics to run, if you will, a value-at-risk algorithm across the loans it wants to issue. Also, the same kind of algorithm is used by a hedge fund trying to optimize its trading and minimize risk during the course of the trading day. You can see that doing analytics on data that's churning very rapidly during the course of the workday, with large volumes of data coming in from many users, is a real advantage of an IMDG because that data is in-memory and can be analyzed rapidly.

Well, Bill, you mentioned the hurricane and the disastrous effects. Are there other operational areas where an in-memory data grid can be deployed? 

William Bain: Well, actually, yes. What's interesting is that you can use an in-memory data grid for many other applications. For example, you can do analysis of logistics and make sure that all of the planning is being carried out in the way intended. In a smart grid you can use an IMDG to watch and optimize energy flow. So I think we're going to see widespread use of IMDGs throughout many sectors over the coming years.

We've talked a lot about analytics, what else can an IMDG or in-memory data grid be used?

William Bain: Some of the applications we're looking at, as I mentioned, include smart grids for energy. We see climate change coming on rapidly and would like to use an IMDG to help conserve energy within a smart grid. Also, we see uses in manufacturing, military, and civilian logistics, for example, in the placement and motion of assets in real time to make sure those operations are optimized. There are other uses in telecom systems and game servers. The list just goes on and on, including other applications in financial services and credit unions for managing loans. In general, we see a large number of applications in which the intent is to handle fast-changing data in a scalable way and be able to analyze that data and react in as close to real time as possible. I think companies that employ this technology will have that ability to analyze and react to changes in seconds instead of minutes, hours, or even days.

Bill, thank you so much for the insight today into in-memory data grids. It is amazing what ScaleOut Software is doing, especially with such significant amounts of data.

  • Ron PowellRon Powell
    Ron is an independent analyst, consultant and editorial expert with extensive knowledge and experience in business intelligence, big data, analytics and data warehousing. Currently president of Powell Interactive Media, which specializes in consulting and podcast services, he is also Executive Producer of The World Transformed Fast Forward series. In 2004, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010.  Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). He maintains an expert channel and blog on the BeyeNETWORK and may be contacted by email at rpowell@powellinteractivemedia.com. 

    More articles and Ron's blog can be found in his BeyeNETWORK expert channel.

Recent articles by Ron Powell

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!