Understanding Real-World Use of Big Data and Analytics
by Ron Powell
Originally published August 22, 2017
Edd, how is big data transforming the way businesses operate today?
Edd Wilder-James: Big data represents a very large opportunity for businesses today. From what I’ve seen over the last fifteen years, it’s time to wake up to it. That opportunity is not just big data as a product category. I believe what lies underneath has the ability to drive value inside an organization with data. This is quite new. This is taking it out of the reporting and back-office world and into places where you can save cost, generate new value, innovate new products and deal with customers in a different way through data. So really when we say big data that way, we’re talking about leveraging data outside of the traditional sandbox of analytics and reporting. We’re looking at how it can be mixed with the business to create new insights and opportunity.
Robert, as vice president of analytics, you head up a very large data science R&D group. Could you tell us about what your group does?
Robert Stratton: The data science R&D team, an internal team in Neustar is looking at how it can bring data science to improve our internal processes and also how we can develop new products and improve the existing products that we have that we sell to clients. Primarily I’m in the marketing services area. We’re using data analytics to improve our products and clients offerings.
Edd, how has data and analytics enhanced your overall effectiveness?
Edd Wilder-James: At Silicon Valley Data Science we work with a lot of companies, and clearly a lot of them understand that there is an opportunity with data to create new value to do new things. So we get involved and really help move things forward with them. Typically, what happens is to determine first if this is a business need. This is not being driven by someone saying, “Oh, we could do great things if we had more precise and more nuanced analytics.” We begin by looking at the problem at hand, such as that it takes 30 days too long to get a product to market, or that they want to add a new feature but it is not economically feasible. Then you address the engineering. You address the data science later on. Economically with big data and features – machine learning and so on that we couldn’t do before –the business problem must be the first point of entry.
Edd, from an infrastructure perspective, what do you see that makes your clients successful?
Edd Wilder-James: One of the most important things is creating the infrastructure with the end use in mind. That is not as obvious as it sounds because a lot of vendors just try to get you on their platform – for example, build a data lake because everyone else has one. The end use of the analytics, the structure of the organization and the applications all filter back into the infrastructure you need to create. Certain parts of that are probably not up for huge debate – things like using Hadoop, for instance. There is a whole raft of tools that we need to look at when we’re trying to solve those problems. And it becomes a lot more about understanding the data in your organization than caring too much about infrastructure. Yes, it’s important, but actually understanding where data is, who owns the data, what it is used for and how deep it is, is equally important and probably neglected a lot by folks who come from the IT world and look at infrastructure first.
Robert, you have a very big infrastructure for handling analytics. Could you share with us how you put that infrastructure together and describe the key components?
Robert Stratton: The biggest change we’ve seen over the last five or ten years is that everything we’re looking at is essentially big data, so the ability to do marketing analytics on a desktop machine going back ten years was perfectly feasible. Now, everything we’re looking at needs to be stored in a distributed fashion. And, also, the analytics that we’re doing needs to be distributed as well. The modeling files that we’re using are considerably bigger than they were five or ten years ago. Distributed computing and big data are probably the main thing we have in mind when we’re looking at any new analytic solution. Things like Hadoop are still very core to what we’re doing as is Spark – essentially any analytical tools that we can parallelize and use to handle big data analytics.
The number of data sources and the volume of data are significant. How are you able to control that, Robert?
Robert Stratton: The data is actually quite highly structured, particularly with advertising event logs and web visits. There is a lot of structured data. On the other hand, some of our products still use a form of unstructured data. That could be things coming through from, for example, an Excel file that a client provides, or a CSV that gets output by a website. It is really a combination. In some of our applications, we have a large amount of structured data, and in others there is still some amount of ad hoc processing that we need to do to bring in the data. We’re looking at other tools to help with that ingestion.
You mentioned Spark and Hadoop from the data storage side of the world. How are the people within your organization – the data scientists – able to access it readily?
Robert Stratton: There is a range of data sources that we use. Some are just event logs that get sent to us on an FTP site every day, and we can just pick them up and read them in without much trouble. Some of the other data that we’re using arrives by email. Some of it gets passed to us on an ad hoc basis. It is a continuum of very automated and semi-manual work. From a data science point of view, we want to know as much as possible about the data-generation process. So we need to understand what system created the data, what real-world process generated it, what data was recorded and what data was not recorded in the logs that we’re looking at. And we’re quite highly interested in causal analytics, so what were the things that led to a particular business outcome. We really need to understand all of the data and how it is generated to get to the results that are useful for our clients.
Are you using any products for cataloging or for querying the data?
Robert Stratton: We’ve been using Alation quite heavily, particularly for the last few months while we’ve been looking at a couple of new data sources that have been coming in and exploring some of the other data assets that the company has. Alation has proved useful to us, both as a cataloging tool and a querying tool. And Alation also helps us share the queries amongst us to know what table contains what data. We’ve been using Alation quite heavily for that recently. Once we know where the data is, doing the exploratory analysis on it and looking at hypotheses using the query tool is another way that we’ve been using Alation.
What are your analysts using from an infrastructure perspective?
Robert Stratton: We are using a combination of tools. We use H2O for some of our analysis, and we’re increasingly moving into Spark, using the analytical functionality. And, also, we’re writing our own analytical tools with Scala and the things that underlie it in the Spark interfaces. We still use a combination of R and Python at different parts in the process. Of course, it is difficult to parallelize those effectively, so a lot of the big data analytics we’re doing is still in distributed software like Spark and H2O.
Edd, you’re engaged with a lot of clients and customers. How have they addressed the resource demands for talent and analytics?
Edd Wilder-James: We don’t all need to hire the rock stars! We can build our platforms to enable people and reduce the amount of effort they have to do to work with data. If you want to attract the best, you have to be one of the best companies to work inside. You have to have fun problems and data. You have to use tools that they get excited about. The hardest thing, for data scientists particularly, is that they’re curious for new problems. It’s all very well if they go and solve a problem and create the best system in the world, but if you don’t have a new set of problems for them to move on to and continue to add value and continue to discover things, they’re going to want to move on. Attracting them is one thing. Retaining them is another. It helps to have interesting problems.
How are companies addressing the complexity with data ingestion from both an analytics perspective and an operational perspective?
Edd Wilder-James: Eighty percent of data science is getting and prepping the data. Even though everyone is getting better at that and there are tools for that, that’s always going to be the case. I’ve talked to a lot of CIOs, and over the last couple of years they don’t need to be sold on the value of analytics. They don’t need to be sold on the value of big data. The problem – how do we get the data out of databases and into somewhere central – started way earlier? It’s just a law of physics. Just as getting crude oil out of the ground is expensive, getting data out of databases is expensive. It’s always going to be that way. So we might try to make it easy, but if it’s easy to do, everybody is going to be doing it. Then where’s your competitive advantage? How do you choose the applications for that data that are the right ones to deliver value to the business straight away. Nobody is going to finance a 10-year project to unify all your databases for some random, unspecified cause just because it seems like a good idea. We’re not in this world anymore where we can expect to have 100 percent coverage of the data in our organization. You catalog 100 percent of the data in your organization, and Bob creates a spreadsheet. Now your data is out of date. Instead, we have to be pragmatic. We have to use tools like Alation for expiration of the data to help us google our own enterprises, and we need to choose the use cases and the datasets that are most useful for the next task at hand.
Robert, from a best practice perspective, what advice could you share about what you’re doing with analytics?
Robert Stratton: One of the main things that is still relevant for us is that really understanding the data and understanding how the data was created is still key. For us, because we’re looking at why things happen and how those things lead to business outcomes, I think the main thing that we still focus on is what the data means, how it came together, and how we should be interpreting it in order to do analytics with it. We’re probably less in the forefront of some of the very new classification methods, and we’re still more interested in inference and learning about business processes through the data that we’re seeing.
Edd, what best practice could you share about big data and analytics?
Edd Wilder-James: I think realizing that it is not just a technical problem and that whenever you get involved with data – particularly in the context of the whole business – there are also organizational and political considerations that you have to consider as well. You’re in the business of looking into processes that different people own, uncovering things they maybe don’t know about or are blindsided by, and you want to win them over, enabling them as you work with them. You need to understand that at least part of your data team has to be political to unify the business while working together with the technical side. So we’re in a great place for tools, but fundamentally we’re in an era where looking into data means looking into the very things the business does and the consequences. You want to be savvy as you go about it.
It’s been a pleasure talking with both of you to gain real-world insights into the challenges of big data and analytics.
Recent articles by Ron Powell
Copyright 2004 — 2017. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC