Predictive Analytics: Benefits and Challenges of Using Graph Theoretic Methods Leveraging Contextual Network Information for Better Accuracy

Originally published September 2, 2010

The practice of predictive analytics has come a long way since the advent of operations research. Current predictive models routinely use more than 100 variables, most of them are characteristics of the object being analyzed, including age and gender for customer analysis, or color and text for advertisements. In this article, we will explore how graph methods can be used in predictive analytics.

Graph methods refer to techniques that analyze a network of objects, rather than pie charts and trend lines. Networks are made up of nodes and connections among the nodes. Other synonyms include vertices, edges, and arcs. Graph methods analyze the link structure, rate of diffusion and the clustering of nodes in a network. Many concepts in the popular press such as Google’s page rank, social network analysis, influencer marketing and driving directions all utilize graph theoretic methods.

Real World Use Cases

Why are these graph methods useful for predictive analytics? One reason is that additional contextual predictors can be derived using graph methods. Let’s start with a well known example: Google uses page rank as one of many predictors for the relevance of a web page. The link structure in the world-wide-web network provides valuable contextual information about which pages are deemed most relevant by the web page creators—this contextual link structure is then used to predict relevance for a user’s query.

Another example: If you are trying to predict the revenue contribution of each employee, in addition to the educational level, job responsibility and experiences, you might consider how well the employee relates to the customers or other managers. The contextual information added here is not something inherent about the employee. IBM conducted such a study with 2600 of their consultants, and created a model with revenue as the outcome target variable and three sets of predictor variables.1 In addition to the standard employee and job-related characteristics, two sets of contextual variables were added. One was related to the structure of the web of relationships, and the other was related to how each employee was related to managers and customers. It turned out that graph-derived contextual predictors were highly correlated with revenue.

Enriching Context Where It Seemingly Did Not Exist

One common question raised is how much of the contextual information is already captured in the demographic data. After all, there is a common understanding that birds of the same feather flock together. To tackle this question, a group of researchers at MIT studied how the Yahoo instant messenger network affected the adoption of the Yahoo Go product.2 They discovered that only 50% of the contextual network effect could be explained by the common underlying demographic variables.

This finding has big implications for predictive analytics in data poor environments. A data poor environment would benefit greatly from using contextual predictors that you already have. For example, if you are trying to predict churn in a prepaid mobile customer base where there is little to no demographic information, using the social network contextual information can greatly improve the prediction accuracy and lower the false positive rate.

Another type of contextual network information that can be included into predictive analytics is the underlying form of the social community that one belongs to. In the study with Yahoo data above that was trying to predict product adoption, they found that the chances of you adopting a product rises if more people in your social community have already adopted the product.

Besides social contextual network data that can be added to predictive functions, you can also deduce contextual information. For example, in trying to predict fraud from a set of credit applications, we can construct a network of information attributes. Using the basic idea of that people with bad credit tended to relate to other similar people, some researchers generated the graph-based hub and authority variables from such a network and added them to a support vector machine prediction function.3 The graph-derived context variables improved the fraud prediction by approximately 20%.

Clearly, the benefits of adding graph-based predictors are better lifts and lower rates of false positives. The improvements range from 20% to 100% depending on the richness of data available today. These benefits translate directly into bottom line results to the extent that your company is already putting predictive analytics into daily operations. So what are the practical aspects of putting this into operation?

How Do You Get Started?

Let’s start with the data: Where do you obtain the data to construct such contextual networks? The IBM study looked at the anonymized email traffic, address books and buddy lists. If you were a retailer like Amazon, you could use wish lists, product gift transactions and share-the-love data to construct the social context for predictions. Financial services companies can use the trading data to construct the contextual network among the traders or credit application data to construct an information network. Telecommunications companies can leverage the call detail records. In fact, the explosion of data collection means that most organizations already have this contextual network information just waiting to be accessed and utilized.

One unique aspect of contextual network predictors is that they are usually specific to the target outcome you want to predict. This is the definition of contextual after all. So if you want to predict churn, then you need to understand how the contextual network predisposes a subscriber to churn. Such contextual churn insight will certainly be different if you were trying to predict employee revenue or product adoption.

Since the contextual environment is changing, the window that an enterprise has to leverage the additional predictive power of contextual networks is also dynamic. Most customers will switch vendors very soon after they learn about a bad experience from a friend, so vendors must respond in a timely manner. Practically, this means that the calculation of graph-based predictors must occur more frequently.

Some contextual network information is relatively stable over time. For example, if someone has a preferred takeout restaurant, she will likely continue to call that restaurant even after she gets a new phone number. These “creatures of habits" can be more easily identified, say in the case of fraud or terrorism prevention.

Challenges

One potential impediment to leveraging contextual network information involves the privacy and data security issues. Since contextual networks tend to involve personal transactions, companies would like to ensure the highest security to prevent breaches. Various regulatory bodies at different state, country or global levels might have conflicting guidelines on what information is considered private and cannot be used for data analytics.

Another obstacle today is the scalability challenge. Often these contextual networks are many times the size of the current data used for analysis. Contextual networks rely on relationship information, so sampling runs the risk of losing valuable information. James Kobelius of Forrester postulates that the advent of social network analysis will catalyze the need for petabyte-sized data warehouses.4

Another obstacle is the lack of software tools that will generate these graph-based predictors. Types of specialized graph based calculations include community detection, diffusion simulation, vertex similarity, topology and centrality analysis. Open source tools such as Jung and Pajek work well for R&D, while proven commercial tools such as Sonamine are geared toward large scale and production graph mining.

A last obstacle is education. Contextual network predictors are unlike other types of data being captured and analyzed. Enterprises must first capture this new data, and that will require some mind-set shift. Setting up an infrastructure to model the contextual network will require new tools and approaches. Finally, using these contextual networks in predictive analytics that improve the bottom line will require the cooperation of multiple disciplines—analytics, marketing and operations.

Despite obstacles, the time to investigate and leverage these graph-based methods is now. IDAnalytics, a company that provides identity theft protection, has been issued a patent for a system and method for identity-based fraud detection through graph anomaly detection. How will you start adding graph-derived predictors into your predictive analytics?
 
References:
  1. http://www.businessweek.com/technology/content/apr2009/tc2009047_031301.htm
  2. Sinan Aral et. al. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. PNAS, December 22, 2009, vol. 106, no. 51, 21544–21549
  3. Xu Xiujuan, et. al. Credit scoring algorithm based on link analysis ranking with support vector machine. Expert Systems with Applications 36 (2009) 2625–2632
  4. http://blogs.forrester.com/business_process/2010/01/social-network-analysis-the-fuse-igniting-enterprise-data-warehouse-growth-its-planet-petabyte-or-bu.html
 
 



  • Nick LimNick Lim
    Nick is CEO and Founder of Sonamine, LLC. Over the past 15 years, Nick has worked in analytic environments with large amounts of data. Leveraging his training at Harvard, he has led product and strategy teams at MicroStrategy, Enpocket and Nokia to build systems that predicted ad clicks and consumer behavior using terabytes of information.

    Sonamine has developed a scalable graph mining platform that processes hundreds of millions of nodes and billions of edges. The Sonamine platform is used by customers to improve their predictive algorithms in customer retention, marketing and risk management. Nick may be reached at: (617) 755-5952 or by email at: nick@sonamine.com.


 

Comments

Want to post a comment? Login or become a member today!

Posted June 21, 2011 by dave fred

We have started to work  with Idiro recently  and happy with results so far. We have also worked with KXEN and partly IBM. Sonamine is not the only player in this upcoming field, IBM, SAP have been sharpening the capabilities of their tools to accomodate much comprehensive predictive analytics. KXEN is also another player that has been helpful to many companies globally. Sonamine looks like a very amateur company among these indeed. I know a few friends who tried to reach out to them and they have waited for days for an easy answer. They say it is not a reliable company at all. This article is promoting this company because it is written by the owner of the company. I just want to remind everyone that it is your data and it is extremely important who you work with in this field. So make sure you window shop very carefully before making a decision. That is what we have experienced. 

Is this comment inappropriate? Click here to flag this comment.

Posted September 2, 2010 by Neil Raden nraden@hiredbrains.com

Hi Nick,

Boy it's been ages since we talked!

I have a question. The cost of devising and implementing a sizable predictable model-based application, and its attendant connection to some sort of automated decision engine, is quite high. In the example you gave, isn't predicting churn based on a bad experience near someone else (graphically speaking) a pretty narrow application? It sounds good in theory, but is the cost (high) benefit (narrow and limited within the overall scope of the business) justifiable?

-Neil Raden

Hired Brains

Is this comment inappropriate? Click here to flag this comment.