The practice of predictive analytics has come a long way since the advent of operations research. Current predictive models routinely use more than 100 variables, most of them are characteristics of the object being analyzed, including age and gender for customer analysis, or color and text for advertisements. In this article, we will explore how graph methods can be used in predictive analytics.
Graph methods refer to techniques that analyze a network of objects, rather than pie charts and trend lines. Networks are made up of nodes and connections among the nodes. Other synonyms include vertices, edges, and arcs. Graph methods analyze the link structure, rate of diffusion and the clustering of nodes in a network. Many concepts in the popular press such as Google’s page rank, social network analysis, influencer marketing and driving directions all utilize graph theoretic methods.
Real World Use Cases
Why are these graph methods useful for predictive analytics? One reason is that additional contextual predictors can be derived using graph methods. Let’s start with a well known example: Google uses page rank as one of many predictors for the relevance of a web page. The link structure in the world-wide-web network provides valuable contextual information about which pages are deemed most relevant by the web page creators—this contextual link structure is then used to predict relevance for a user’s query.
Another example: If you are trying to predict the revenue contribution of each employee, in addition to the educational level, job responsibility and experiences, you might consider how well the employee relates to the customers or other managers. The contextual information added here is not something inherent about the employee. IBM conducted such a study with 2600 of their consultants, and created a model with revenue as the outcome target variable and three sets of predictor variables.1
In addition to the standard employee and job-related characteristics, two sets of contextual variables were added. One was related to the structure of the web of relationships, and the other was related to how each employee was related to managers and customers. It turned out that graph-derived contextual predictors were highly correlated with revenue.
Enriching Context Where It Seemingly Did Not Exist
One common question raised is how much of the contextual information is already captured in the demographic data. After all, there is a common understanding that birds of the same feather flock together. To tackle this question, a group of researchers at MIT studied how the Yahoo instant messenger network affected the adoption of the Yahoo Go product.2
They discovered that only 50% of the contextual network effect could be explained by the common underlying demographic variables.
This finding has big implications for predictive analytics in data poor environments. A data poor environment would benefit greatly from using contextual predictors that you already have. For example, if you are trying to predict churn in a prepaid mobile customer base where there is little to no demographic information, using the social network contextual information can greatly improve the prediction accuracy and lower the false positive rate.
Another type of contextual network information that can be included into predictive analytics is the underlying form of the social community that one belongs to. In the study with Yahoo data above that was trying to predict product adoption, they found that the chances of you adopting a product rises if more people in your social community have already adopted the product.
Besides social contextual network data that can be added to predictive functions, you can also deduce contextual information. For example, in trying to predict fraud from a set of credit applications, we can construct a network of information attributes. Using the basic idea of that people with bad credit tended to relate to other similar people, some researchers generated the graph-based hub and authority variables from such a network and added them to a support vector machine prediction function.3
The graph-derived context variables improved the fraud prediction by approximately 20%.
Clearly, the benefits of adding graph-based predictors are better lifts and lower rates of false positives. The improvements range from 20% to 100% depending on the richness of data available today. These benefits translate directly into bottom line results to the extent that your company is already putting predictive analytics into daily operations. So what are the practical aspects of putting this into operation?
How Do You Get Started?
Let’s start with the data: Where do you obtain the data to construct such contextual networks? The IBM study looked at the anonymized email traffic, address books and buddy lists. If you were a retailer like Amazon, you could use wish lists, product gift transactions and share-the-love data to construct the social context for predictions. Financial services companies can use the trading data to construct the contextual network among the traders or credit application data to construct an information network. Telecommunications companies can leverage the call detail records. In fact, the explosion of data collection means that most organizations already have this contextual network information just waiting to be accessed and utilized.
One unique aspect of contextual network predictors is that they are usually specific to the target outcome you want to predict. This is the definition of contextual after all. So if you want to predict churn, then you need to understand how the contextual network predisposes a subscriber to churn. Such contextual churn insight will certainly be different if you were trying to predict employee revenue or product adoption.
Since the contextual environment is changing, the window that an enterprise has to leverage the additional predictive power of contextual networks is also dynamic. Most customers will switch vendors very soon after they learn about a bad experience from a friend, so vendors must respond in a timely manner. Practically, this means that the calculation of graph-based predictors must occur more frequently.
Some contextual network information is relatively stable over time. For example, if someone has a preferred takeout restaurant, she will likely continue to call that restaurant even after she gets a new phone number. These “creatures of habits" can be more easily identified, say in the case of fraud or terrorism prevention.
One potential impediment to leveraging contextual network information involves the privacy and data security issues. Since contextual networks tend to involve personal transactions, companies would like to ensure the highest security to prevent breaches. Various regulatory bodies at different state, country or global levels might have conflicting guidelines on what information is considered private and cannot be used for data analytics.
Another obstacle today is the scalability challenge. Often these contextual networks are many times the size of the current data used for analysis. Contextual networks rely on relationship information, so sampling runs the risk of losing valuable information. James Kobelius of Forrester postulates that the advent of social network analysis will catalyze the need for petabyte-sized data warehouses.4
Another obstacle is the lack of software tools that will generate these graph-based predictors. Types of specialized graph based calculations include community detection, diffusion simulation, vertex similarity, topology and centrality analysis. Open source tools such as Jung and Pajek work well for R&D, while proven commercial tools such as Sonamine are geared toward large scale and production graph mining.
A last obstacle is education. Contextual network predictors are unlike other types of data being captured and analyzed. Enterprises must first capture this new data, and that will require some mind-set shift. Setting up an infrastructure to model the contextual network will require new tools and approaches. Finally, using these contextual networks in predictive analytics that improve the bottom line will require the cooperation of multiple disciplines—analytics, marketing and operations.
Despite obstacles, the time to investigate and leverage these graph-based methods is now. IDAnalytics, a company that provides identity theft protection, has been issued a patent for a system and method for identity-based fraud detection through graph anomaly detection. How will you start adding graph-derived predictors into your predictive analytics? References:
- Sinan Aral et. al. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. PNAS, December 22, 2009, vol. 106, no. 51, 21544–21549
- Xu Xiujuan, et. al. Credit scoring algorithm based on link analysis ranking with support vector machine. Expert Systems with Applications 36 (2009) 2625–2632