We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Straightforward Analytics: Predictive Analytics, Correlation, Predisposition, and Some Risks

Originally published August 30, 2012

What does it mean for analytics to be “predictive”? For the most part, it encompasses a growing body of methods and techniques regarding the use of data mining, machine, learning, and other statistical methods for analyzing collections of historical events and transactions to gain insight to help in predicting the future. Often, the prediction is reflected in terms of either a predisposition to taking some action or as a risk of some future occurrence.

As an example, the health care field looks to use predictive analytics to assess patient risks (such as developing certain conditions or diseases). Another example is credit scoring, used in the financial industry to predict whether an individual is more or less likely to default on a loan. In a retail sales process, a predictive model might be used to determine a customer’s likelihood of making a particular product purchase at a specific time.

There are many aspects to developing predictive models (which I intend to discuss in future columns), but most applications of these techniques share two characteristics:

  1. They focus on differentiating the relevant independent variables (that is, variables whose values are available at the start of a process) and one or more dependent variables (whose values are essentially created as a result of the process).

  2. The modeling approach involves analyzing sample sets of data to look for correlations and patterns that can be used with future data sets for prediction.
These characteristics are interrelated. The objective of the analysis is to determine the best set of independent variables that are presumed to exert the greatest influence on the dependent variable. To continue one of our earlier examples, one predictive model is to rate a borrower’s risk of loan default, which would be the dependent variable. The next question is to look at the potential inputs (such as geo-demographic variables, annual income, amount of liquid assets, etc.) to see which ones are correlated with loan default and to which degree the variables’ values exert influence.

But before diving deep into the techniques and algorithms used for predictive analytics, consider the degree to which some biases or risks may creep into the environment:
Correlation vs. causation – You must distinguish between correlation between two (or more) variables and causation. Correlation indicates dependence; causation suggests the result of one is caused as a result of the other. But just because there is correlation between two variables, it does not imply that one caused the other. Assuming that correlation implies causality can lead to presumption that taking one set of actions will result in certain decisions being made, which may not always be true.

Misunderstanding root causes of predisposition
– This is somewhat a variation on the previous theme, but differs subtly because it allows analysts to attribute intent that might be spurious. For example, I once attended a talk in which the results of an analysis of sales activity at a theme park indicated that when it rained, sales increased. However, one cannot say that rain makes people want to spend more money. Rather, when it rained, people went inside stores to get out of the rain and ended up spending money because they were already in the store.

Influence and correlation as a limitation for presenting alternatives – Let’s say that your analysis has determined that in a majority of the decision-making situations, three of the choices are selected 80% of the time. This might suggest limiting the choices to those three, but that sets artificial limits on the decision makers by eliminating all the other choices.

Bias resulting from selected training data sets – Predictive models are evolved, you might say, as a result of analyzing training data and then applying the model to new data. One risk is that there are biases that existed in the data used for training that become integrated into the model, but don’t hold true in the general case.

Reliance on complex automated models – Recent news items regarding flaws in algorithmic trading applications leading to significant costs point to the risk of abdicating responsibility to predictive algorithms for decision making.

Flaws in the input assumptions – The model may operate absolutely perfectly as long as the input is of acceptable quality. But that same model may be acutely sensitive to incomplete or inaccurate data, which can also lead to drawing incorrect conclusions.
When the proper caution is added to the mix, predictive analytics can provide a significant lift when it comes to not just predicting behavior, but ultimately influencing behavior. Yet one must be aware of potential barriers to usability when misguided assumptions are made about the expected results. In future articles I will look at more details of predictive analytics, and how the models are developed and put to use.

Recent articles by David Loshin



Want to post a comment? Login or become a member today!

Be the first to comment!