(Note: This is the sixth and final article in a series on advanced analytics.)
Model-making is at the heart of advanced analytics. Thankfully, few of us need to create analytical models or learn the statistical techniques upon which they're based. However, any self-respecting business intelligence (BI) professional needs to understand the modeling process so he can better support the data requirements of analytical modelers.
An analytical model is simply a mathematical equation that describes relationships among variables in a historical data set. The equation either estimates or classifies data values. In essence, a model draws a "line" through a set of data points that can be used to predict outcomes. For example, a linear regression draws a straight line through data points on a scatterplot that shows the impact of advertising spend on sales for various ad campaigns. The model's formula--in this case, "Sales=17.813 + (.0897* advertising spend)"-- enables executives to accurately estimate sales if they spend a specific amount on advertising. (See figure 1.)
Algorithms that create analytical models (or equations) come in all shapes and sizes. Classification algorithms, such as neural networks, decision trees, clustering, and logistic regression, use a variety of techniques to create formulas that segregate data values into groups. Online retailers often use these algorithms to create target market segments or determine which products to recommend to buyers based on their past and current purchases. (See figure 2.)
Trusting Models. Unfortunately, some models are more opaque than others; that is, it's hard to understand the logic the model used to identify relevant patterns and relationships in the data. The problem with these "black box" models is that business people often have a hard time trusting them until they see quantitative results, such as reduced costs or higher revenues. Getting business users to understand and trust the output of analytical models is perhaps the biggest challenge in data mining.
To earn trust, analytical models have to validate a business person's intuitive understanding of how the business operates. In reality, most models don't uncover brand new insights; rather they unearth relationships that people understand as true but aren't looking at or acting upon. The models simply refocus people's attention on what is important and true and dispel assumptions (whether conscious or unconscious) that aren't valid.
Given the power of analytical models, it's important that analytical modelers take a disciplined approach. Analytical modelers need to adhere to a methodology to work productively and generate accurate models. The modeling process consists of six distinct tasks:
- Define the project
- Explore the data
- Prepare the data
- Create the model
- Deploy the model
- Manage the model
Interestingly, preparing the data is the most time-consuming part of the process, and if not done right, can torpedo the analytical model and project. "[Data preparation] can easily be the difference between success and failure, between usable insights and incomprehensible murk, between worthwhile predictions and useless guesses," writes Dorian Pyle in his book, "Data Preparation for Data Mining."
Figure 3 shows a breakdown of the time required for each of these six steps. Data preparation consumes one-quarter (25%) of an analytical modeler's time, followed by model creation (23%), data exploration (18%), project definition (13%), scoring and deployment (12%), and model management (9%). Thus, almost half of an analytical modelers' time (43%) is spent exploring and preparing data, although this varies based on the condition and availability of data. Analytical modelers are like house painters who must spend lots of time preparing a paint surface to ensure a long-lasting paint finish.
Figure 3. Analytical Modeling Tasks
From Wayne Eckerson, "Predictive Analytics: Extending the Value of Your Data Warehousing Investment," 2007. Based on 166 respondents who have a predictive modeling practice.
Project Definition. Although defining an analytical project doesn't take as long as some of the other steps, it's the most critical task in the process. Modelers that don't know explicitly what they're trying to accomplish won't be able to create useful analytical models. Thus, before they start, good analytical modelers spend a lot of time defining objectives, impact, and scope.
Project objectives consist of the assumptions or hypotheses that a model will evaluate. Often, it helps to brainstorm hypotheses and then prioritize them based on business requirements. Project impact defines the model output (e.g., a report, a chart, or scoring program), how the business will use that output (e.g., embedded in a daily sales report or operational application or used in strategic planning), and the projected return on investment. Project scope defines who, what, where, when, why, and how of the project, including timelines and staff assignments.
For example, a project objective might be: "Reduce the amount of false positives when scanning credit card transactions for fraud." While the output might be: "A computer model capable of running on a server and measuring 7,000 transaction per minute, scoring each with probability and confidence, and routing transactions above a certain threshold to an operator for manual intervention."
Data Exploration. Data exploration or data discovery involves sifting through various sources of data to find the data sets that best fit the project. During this phase, the analytical modeler will document each potential data set with the following items:
- Access methods: Source systems, data interfaces, machine formats (e.g. ASCII or EBCDIC), access rights, and data availability.
- Data characteristics: Field names, field lengths, content, format, granularity and statistics (e.g. counts, mean, mode, median, and min/max values)
- Business rules: Referential integrity rules, defaults, other business rules
- Data pollution: Data entry errors, misused fields, bogus data
- Data completeness: Empty or missing values, sparsity
- Data consistency: Labels and definitions
Typically, an analytical modeler will compile all this information into a document and use it to help prioritize which data sets to use for which variables. (See figure 4.) A data warehouse with well documented metadata can greatly accelerate the data exploration phase because it also maintains much of this information. However, analytical modelers often want to explore external data and other data sets that don't exist in the data warehouse and must compile this information manually.
Data Preparation. Once analytical modelers document and select their data sets, they then must standardize and enrich the data. First, this means correcting any data errors that exist in the data and standardizing the machine format (e.g. ASCII vs EBCDIC). Then, it involves merging and flattening the data into a single wide table which may consist of hundreds of variables (i.e., columns). Finally, it means enriching the data with third party data, such as demographic, psychographic, or behavioral data that can enhance the models.
From there, analytical modelers transform the data so it's in an optimal form to address project objectives and meet processing requirements for specific machine learning techniques. Common transformations include summarizing data using reverse pivoting(See figure 5), transforming categorical values into numerical values, normalizing numeric values so they range from 0 to 1, consolidating continuous data into a finite set of bins or categories, removing redundant variables, and filling in missing values.
Modelers try to eliminate variables and values that aren't relevant as well as fill in empty fields with estimated or default values. In some cases, modelers may want to increase the bias or skew in a data set by duplicating outliers, giving them more weight in the model output. These are just some of the many data preparation techniques that analytical modelers use.
Figure 5. Reverse Pivoting
To model a banking "customer" not bank transactions, analytical modelers use a technique called reverse pivoting to summarize banking transactions to show customer activity by period.
Analytical Modeling. Analytical modeling is as much art as science. Much of the craft involves knowing what data sets and variables to select and how to format and transform the data for specific data models. Often, a modeler will start with 100+ variables and then, through data transformation and experimentation, winnow them down to 12 to 20 variables that are most predictive of the desired outcome.
In addition, an analytical modeler needs to select historical data that has enough of the "answers" built in it with a minimal amount of noise. Noise consists of patterns and relationships that have no business value, such as a person's birth date and age, which gives a 100 percent correlation. A data modeler will eliminate one of those variable to reduce noise. In addition, they will validate their models by testing them against random subsets of the data which they set aside in advance. If the scores remain compatible across training, testing, and validation data sets then they know they have a fairly accurate and relevant model.
Finally, the modeler must choose the right analytical techniques and algorithms or combinations of techniques to apply to a given hypothesis. This is where modelers' knowledge of business processes, project objectives, corporate data, and analytical techniques come into play. They may need to try many combinations of variables and techniques before they generate a model with sufficient predictive value.
Every analytical technique and algorithm has its strengths and weaknesses, as summarized in the tables below. The goal is to pick the right modeling technique so you have to do as little preparation and transformation as possible, according to Michael Berry and Gordon Linhoff in their book, "Data Mining Techniques: For Marketing, Sales, and Customer Support."
Table 1. Analytical Models
Table 2. Analytical Techniques
Deploy the Model. Model deployment takes many forms, as mentioned above. Executives can simply look at the model, absorb its insights, and use it to guide their strategic or operational planning. But models can also be operationalized. The most basic way to do operationalize a model is to embed it in an operational report. For example, a daily sales report for a telecommunications company might list each sales representative's customers by their propensity to churn. Or a model might be applied at the point of customer interaction, whether at a branch office or at an online checkout counter.
To apply models, you first have to score all the relevant records in your database. This involves converting the model into SQL or some other program that can run inside the database that holds the records that you want to score. Scoring involves running the model against each record and generating a numeric value, usually between 0 and 1, which is then appended to the record as an additional column. A higher score generally means a higher propensity to portray the desired or predicted behavior. Scoring is usually a batch process that happens at night or on the weekend depending on the volume of records that need to be scored. However, scoring can also happen in real-time, which is essentially what online retailers do when they make real-time recommendations based on purchases a customer just made.
Model Management. Once the model is built and deployed, it must be maintained. Models become obsolete over time, as the market or environment in which they operate changes. This is particularly true for volatile environments, such as customer marketing or risk management. Also, complex models that deliver high business value usually require a team of people to create, modify, update, and certify the models.
In such an environment, it's critical to have a model repository that can track versions, audit usage, and manage a model through its lifecycle. Once an organization has more than one operational model, it's imperative it implements model management utilities, which most data mining vendors now support.
Analytical models can be powerful. They can help organizations use information proactively instead of reactively. They can make predictions that streamline business processes, reduce costs, increase revenues, and improve customer satisfaction.
To create analytical models is as much art as science. A well-trained modeler needs to step through a variety of data-oriented tasks to create accurate models. Much of the heavy lifting involved in creating analytical models involves exploring and preparing the data. A well designed data warehouse or data mart can accelerate the modeling process by collecting and documenting a large portion of the data that modelers require and transforming that data into wide, flat tables conducive to the modeling process.
Posted November 29, 2011 1:42 PM
Permalink | No Comments |