This BeyeNETWORK spotlight features Ron Powell's interview with Paul Kent, SAS Vice President of Big Data. Ron and Paul discuss how SAS is approaching big data, and Paul provides examples of the insights big data can provide.Paul, with SAS being one of the leaders in analytics, how do you define “big data” and how does it differ from what we formally considered big data?
Well, my definition is actually a convenient one. I say that it's more volume or more complexity than you're comfortable with. It's that tipping point where your traditional tools and your traditional approaches are reaching some strain or the edge of their capabilities. I think that fits for a lot of folks. They would like to process more data, but they know their current tool chain can't handle it. They would like to process more data, but they know that their budget won't allow them to keep it in the technology that they currently use today. So my big data definition is that it is the situation where you've crossed some divide that your current toolset cannot help you handle. What do you think of that?That is an excellent definition, especially as we have now moved beyond terabytes to petabytes.Paul Kent:
Right. So how is SAS evolving to address the new world of big data and how does your high performance analytics server fit into this picture?
Well, we have to reinvest in our software. We're in a big cycle here. Every ten or so years, Dr. Goodnight challenges us to do something big with our software. The previous time, he challenged us to rewrite it from soup to nuts in the "C" language, which completely took us from the mainframe to being credible on PCs and UNIX boxes and turned out to be a pretty clever investment on his part to say, “No, I'm going to rewrite it all from the ground up to anticipate the capability of running on all these new environments.” So in the current timeframe, he's challenging us to rewrite our applications and our algorithms to run on massively parallel infrastructures.
We have to learn from our cousins in the high-performance community where they're doing weather simulation, nuclear forecasting, and trying to simulate the birth of our galaxies – the labs that do that kind of exotic computing style. We have to marry that idea up with all the advances in massive parallelism in the database world – the massively parallel databases machines and, of course Hadoop, which is a open source incarnation of the very same idea that you take your data and you spread it out across many servers and send the work down into the data of the appliance. The way we're addressing big data is to rewrite our software to adapt to live in the world of big data and use the techniques that you need to address problems of that size or that complexity.So you're really taking away the concern about how big the data is, correct?Paul Kent:
Yes. In fact, one of our rallying points is why bother to sample any longer. Simply use all of your data. There's no need to spend a lot of time agonizing over the process of choosing a representative sample and proving that it correctly models the big population of data that you drew it from. Why not use all of your data to do your statistics and build your models? Technology has advanced and that’s no longer unthinkable.So as customers are working with the new big data sources, does the data have to be moved into SAS to perform the analytics?Paul Kent:
That's the subtle change in big data. You really have to figure a way to move the work to the data, not the other way around. The first 40 years of our analytics, it was bring me the records of the data set and I will compute the statistics in the SAS server, if you will, and then the model. For most of computing, it was that the data is out there. Bring it to me and I'll calculate on it.
But when the big data phenomenon catches hold, the mass of data has the "center-of-gravity" a lot more than the CPU processing. Successful approaches are the ones that say, “Let me find a way to package up this algorithm and make it work as a team so that I can send the work out to the units (slices) of the data and do the calculations without moving any of that data a long way.”
If you can send the units of work to the same kind of divisions that the data has been spread out on already into a cluster, then you're ahead of the game. This is the challenge. So you don’t have to move your data into SAS, but you have to have it stored in some kind of massively parallel scenario or database or a Hadoop infrastructure. And the rewriting that we're doing on our software is essentially to send the work into those. We have a symmetrical model, like on a Hadoop cluster, where the SAS workers are hosted on the same data nodes as the Hadoop cluster; and we have an asymmetrical model where, like in a Teradata or an Oracle machine, the SAS workers could be on separate nodes from the parallel database device.So obviously today, customers have many choices. Everybody's doing big data today. What is it that makes SAS the best choice for big data?
There are three legs of the stool for my answer on this one. SAS is a leader in analytics. We have by some measure a 31% share of the market, and the next nearest competitor is at about 14%. We have that going for us. And I think we can help you complete the whole lifecycle of an analytic process. It's not just building models, and it's not just taking those models and operationalizing them. You have to monitor their performance over time. You have to do a lot of data prep and assembly before you build good models, and we've been doing this a long time. We have experience both in that full wheel of the circle of life, if you will, of an analytic project, and we have considerable domain experience.
Having done this for many years now, we can take the mathematics and overlay it with the business problems so we're not talking about general-purpose optimization. We're talking about optimization as it tries to allocate your customers to your channels of your retail organization or your bank as you try to reach them over text messages or phone calls from a call center, or direct mail, and so on.So could you provide us with specific industry examples of the types of things that can be accomplished now that we can analyze all the data or truly big data?
In the retail industry, folks are able to do their forecasting and also measuring and modeling of consumer price sensitivity with much finer granularity than in the past. Now they're able to do it at a store level, by a stock-keeping unit (SKU) level, or by a unit of time. In the past, they might forecast regionally or a group of products together in the category of merchandise instead of the individual SKUs that make up the category. Now they're able to take their analysis to that next level of finer granularity because the enhancements in the performance of the algorithms can handle that many more crossings of individual data points to forecast. So in retail, it has allowed folks to become more granular.
You may have seen the commercials for the Snapshot device from Progressive Insurance that they'll put in your car or maybe, if you're like me, in your teenager's car. And the promise of the sensor is that if you are a safe, nice, easy driver, then you'll get a better insurance rate from the company. Now the data exhaust from that Snapshot device is huge for the insurance company. They have to learn how to process it and how to set what kind of driving turns into a discounted insurance rate or a safe driver reward. The very existence of that product was predicated on the ability to handle the massive amounts of data that techniques like Hadoop and other things enable.What has been learned from the big data giants like Google and Facebook? Will other companies be able to duplicate their successful use of big data?
Well, I think the first thing people learn when they try to copy Google and Facebook is that Google and Facebook have a lot more engineers. And certainly one of the things that the big data world has to go through is a little bit of maturity. It’s a bit like the Wild West out there with software tools. The pace of change is very rapid. The amount of manual “plug this into that and get it all working” is actually a lot more than enterprise customers are used to dealing with. And so, they do have to go back to basics, learn a lot of Java to stitch these kinds of tool chains together. I think that's the opportunity for companies like SAS. We can bring the skills of the delivery of software to the enterprise to the techniques that are pioneered at Google and Facebook, and we can buff some of the edges, some of the corners, and make it more approachable by enterprises that are not so lucky to employ as many top-shelf engineers as Google and or Facebook are able to.Do you see incorporating a lot of these analytics into applications so that they can be readily used by the business? Paul Kent:
Absolutely. SAS really has transformed itself into a solutions company so at the outer level we don't just solve the technical problems. We like to go to our customers and solve the big, complex problems they have at a business level. And so the solution or the technique, and the mathematics that might have to operate on big data, is inside the software that solves this business problem. Many people these days are putting more emphasis on catching the bad guys and running down fraud. For example, in the social government space, Los Angeles County issues vouchers for daycare for underprivileged families. Big data techniques allowed them to identify a bad guy who gets vouchers for six or seven kids, and two of them are two hours to the north and dropped off in one spot, and another three are dropped off two hours to the south. Obviously, that’s a little unlikely. They could investigate this chap to see if he's maybe cheating the system. The application of the big data techniques and the graph analytics to show these networks of unlikely connections was a key factor in the success of that solution. So our solution is to try to capture the fraudsters in the Los Angeles County program that provides vouchers for daycare. But the techniques that enabled us to do this were big data and graph analytics.And you mentioned, obviously, risk areas. You're big with the credit card companies. What are you doing with them from a big data perspective?
Absolutely. About 40% of our business comes from financial services and credit cards – the retail side of the business where they're trying to evaluate the likelihood of a customer defaulting on his card. But the other side of the coin, internally in the bank, they need to evaluate their risk exposure of all of the investments in their portfolios. How do they model and simulate future market states to assess how risky their exposure is in all the contracts and counter contracts they have with the rest of the financial community? That calls for a big data approach these days. You need to do many, many calculations, and you need to run stress tests against those calculations. Once you've reached the predicted state, you need to have a few days with it and say what if interest rates don't vary like that, they vary like this. What if we're more sensitive to weather events going forward than we were in the past? Things like that that test the sensitivity of their position. Thank you, Paul, for providing BeyeNETWORK readers with this update on big data analytics at SAS.
Recent articles by Ron Powell