Originally published June 11, 2009
Although randomly assigned test and control groups are considered the “gold standard” for measuring campaign performance, there are many designs without random assignment that are useful to marketers. In this article, we describe one such design, the PrePost/TestControl group design, and contrast variable matching and propensity matching techniques to identify customers that can serve as control groups. We also describe a measure of “incremental sales” as the difference between “expected” and observed sales in the postperiod, where “expected” sales are estimated by weighting the preperiod test group sales by the ratio of control group postperiod sales to control group preperiod sales. We illustrate this design with the matched control group analysis in a case study where a retailer is interested in evaluating the impact of an instore promotion.
Here, we will describe a method to evaluate the impact of promotions without randomly assigning customers to control groups. In Section 1, we discuss several alternative designs and measures of incremental sales – sales that are estimated to be due solely to the impact of a promotion. Section 2 describes the use of the variable matching and the propensity score to find matched control groups.^{1} Section 3 illustrates estimating incremental sales using both methods to find control groups that can be used to evaluate the impact of an actual instore promotion. Section 4 discusses the limitations of the approach and comparisons with other methods, ending with future directions for research.
The ability to test and measure results and evaluate the impact of promotions is a major strength of the direct marketing discipline. There are many different testing designs to harness this strength. We discuss three examples of how these designs are used to measure the impact of promotions. A detailed discussion of the strengths and weaknesses of these designs and others can be found elsewhere.^{2}
In this design, we:
This is a very "natural" type of marketing experimentation; it is easy to set up, inexpensive and requires tracking only one group of customers. Any change from the pre to postperiod is assumed to be due to the new marketing program.
The weakness of this design, of course, is that outside stimuli can affect the answer; it assumes similar conditions hold in the pre and postperiod. The longer the pre and postperiods, the more likely other conditions will affect the test.
Seasonality is an especially potent threat to the validity of this design: changes from one time period to another may be due to normal changes in customer buying behavior, rather than to the impact of a promotion. Retailers typically attempt to remedy this by using last year’s sales as the “preperiod,” but other differences, such as the economic climate between one year and the next, make causal inferences problematic.
Most direct marketers are familiar with test/control designs for evaluating different copies, packages, or offers. A group of customers are randomly assigned into test and control groups. The test group receives the new marketing stimulus, while the control group does not.
Since test and control groups don't differ in any systematic way, differences between the two groups can be attributed to the marketing stimulus, and not to any outside effects. This gives a more accurate "read" than the pre/post design, but at additional cost: two groups and random assignment procedures are required. Nevertheless, this design is the “gold standard” against which all other designs are measured.
This design combines the advantages of the two previous designs for evaluating the impact of promotions over a given time period. In addition to making groups of customers comparable across both observed and unobserved covariates, it explicitly adjusts for changes between the pre and post periods that we would expect to see in the absence of the promotion.
Its relative disadvantages include more time to set up, requiring two different groups; it requires random assignment procedures; and the concept of “expected sales” – integral to the design, as explained in detail below – can be harder to present and explain.
Done properly, this design provides unambiguous proof of program effectiveness and estimates the incremental sales due to the marketing program (i.e., sales that would not have been gained without the marketing program). As such, it is ideal for testing programs where return on investment (ROI) decisions need to be made regarding the allocation of marketing dollars.
There are several measures of incremental sales using this design, one of which is described below:^{3}
Incremental Sales = Observed Sales  Expected Sales
where
'Observed' sales are from the test group in the postperiod
'Expected' sales are from the test group in the postperiod
What is the best estimate of the expected post test group sales, i.e., the sales that would have occurred without the promotion?
One estimate is the preperiod sales, but adjusted by what happened to the control group from the pre to postperiod:
Expected Post Test Group Sales = PrePeriod Test Sales * (Post Control Sales / Pre Control Sales)
Note how the preperiod Test Sales are adjusted: when the Post Control group has higher sales than the Pre Control, the Pre Test group's sales will be multiplied by a number greater than 1; when the postperiod sales are lower than the preperiod control group, the Preperiod Test group sales will be multiplied by a number less than 1, adjusting the "expected" sales downward.
In the following example, a group of heavy users are randomly assigned to test and control groups. The test group is invited to join a frequency program. The pre and postperiods are 3 months in duration, and the dollar amount represents average monthly sales per customer:
Table 1: Example of Pre/Post and Test/Control Incremental Sales
Both groups decline in the postperiod, but the control group declines much more without the program. The "Change" quantity is the raw change, Post Sales minus Pre Sales, the "PostPeriod Expected" is what we would have expected the test group to do if there was not a promotion, calculated as follows:
PrePeriod Test Sales x (Post Control Sales / Pre Control Sales)
= $32.00 x ($24.32 / $31.68)
= $32.00 x 0.77
= $24.57
The difference between the Post Test and Expected Post Test is the Incremental Change due to the promotion: $25.60  $24.57 = $1.03.
The incremental percent change is calculated as the incremental change / Post Expected: ($1.03 / $24.57) x 100 = 4 %.
This represents the average increase in sales per customer we would gain by rolling out the frequency program to all heavy users. Multiplying $1.03 by the number of heavy users and then by 12 months would give the annual incremental sales due to the promotion for all heavy users (assuming the test group is representative of all heavy users).
Note how this design is able to quantify the program's positive effect, in spite of a decline in the postperiod compared to the preperiod. In the absence of the program the decline would have been even greater!
This measure of incremental sales has intuitive appeal and uses the available preperiod information; however, there is no simple statistical significance test or measurement of the error in the estimate as there is for the simple test/control group design. In this case, resampling may be required to construct estimates of the error.
Below is a summary of the designs and their advantages and disadvantages:
Design 
Advantages 
Disadvantages 
Pre/Post  Inexpensive (one group)  Takes time 
Easy to set up  Outside stimuli may affect outcome  
Easy to explain  
Test/Control  Easy to read  More expensive (two groups required) 
Outside stimuli don't affect outcome  Random assignment (or more complex matching procedures) required  
Quick  
Statistical inference easy  
Pre/Post and Test/Control 
Most accurate and comprehensive measurement of 
Expensive 
Harder to explain  
Statistical inference harder (resampling may be required) 
We have argued that the Pre/Post and Test/Control design provides a useful measure of incremental sales. However, it requires a control group and often there cannot be random assignment for ethical or practical business reasons (e.g., we may not want to lose any revenue opportunity by excluding a group of customers from a promotion). How do we obtain a control group without random assignment?
Fortunately, this problem has been extensively researched and many solutions are available.^{4} One technique we have found useful is the matching algorithm described by Rosenbaum^{5} and implemented as a SAS macro.^{6} This particular matching algorithm is most useful when there are at least twice as many untreated customers as there are treated customers. This makes it more likely that a set of untreated customers can be found that match as closely as possible the treated customers on a set of variables related to the outcome. The motivation for matching the two groups is that if they have similar preperiod purchase behavior and other characteristics (such as demographics and attitudes), then any differences we see during or after the promotion can be more plausibly attributed to the promotion and not to preexisting differences between the groups on the covariates. (One exception that remains a threat to this plausible inference is selfselection biases when customers “selfselect” into receiving the treatment; this bias is explored in more detail below).
To match the groups across a set of covariates, Rosenbaum conceived of the matching problem as a network flow optimization problem, amenable to linear programming solutions, where specialized algorithms exist to find the flow through a network with minimum cost. As implemented by Bergstralh and Kosanke, the “greedy” matching algorithm finds matched pairs of test and control customers that differ maximally by an amount specified by the user for each individual covariate.^{6} The algorithm is called “greedy” in that it will find a match with the first potential control that satisfies the maximum difference allowed between a test and control pair for each covariate, even if another control may exist that is more similar to the treated customer (i.e., that has a smaller differences over the matching covariates).
The “optimal” matching algorithm considers all potential controls to a given treated customer and selects the one that not only satisfies the maximum difference specified by the user, but the pair that minimizes the differences over all matching variables. As might be expected, this algorithm takes more time and may become prohibitively timeconsuming when there are many covariates. In practice, when we have three times as many potential controls as treated customers, we have found the ‘greedy’ algorithm to yield acceptable matches.
A potential problem in finding acceptable matches with either of these algorithms is the amount of time for the program to run when the number of variables or the number of records becomes large (e.g., more than six covariates and more than 500,000 records on a singleprocessor Windows XP machine). Also, the probability of finding good matches diminishes as the number of variables increases. In this case, we have found using the propensity score as the sole matching variable with either of these algorithms finds control groups that are acceptable matches.
The propensity score is defined as the probability of assignment to the treated group, usually estimated with a logistic regression model with treatment (0/1) as the dependent variable. Customers with the same or similar propensity scores produce groups that will be matched, or at least balanced, with respect to the covariates in the model. (‘Balanced’ in this case means that even if there exist pairs that are not matched on any one variable, the two groups will have similar means and variances on the covariates.) It is important to note that the propensity score is estimated without reference to the outcome in order to find matched groups that would have occurred had customers been randomly assigned to the groups.
In our example below, we contrast matching with four variables to matching using only one variable – the propensity score.
A specialty fashion retailer invites selected, highspending customers who use the store’s branded credit card to an exclusive, instore shopping event. The event takes place one night when the store closes and only invited customers may shop amidst a festive, catered “shopping party” where there are no crowds, succulent hors d'oeuvre and oneonone attentive service.
A vital issue to management is whether the event merely shifts sales earlier at the expense of longerterm sales that would have occurred in the absence of the promotion. The question can be phrased as, “Does the event generate incremental sales, over and above what would have been expected, after the event occurs?” This question motivates us to evaluate the impact of the promotion using the pre/post and test/control design, where the preperiod is a month before the event and the postperiod is the month after the event.
In this example, we analyzed a subset of customers invited to the store event in one region of the country (all data in this example have been disguised). Nearly 44,000 customers were invited to the event; of these, 18% (7,932 of 43,731) actually shopped at the event. The other 82% of customers who did not shop the event are the “pool” of customers from which we will select our matched control group. Note that there is a potential selfselection bias, in that people who shop the event may do so for reasons that are not related to the purchase behavior variables we used in matching and may therefore be an uncontrolled source of variation between the two groups.
We matched on four variables that were measured over the 12 months prior to mailing of the invitation:
When we first proposed analyzing these events using this matching methodology, management asked, “Why not just compare the shoppers and nonshoppers?”
The table below shows how different these two groups of shoppers are in terms of the matching variables:
Matching Variable

EventBuyer


No

Yes


Sales

$5,206

$7,745

Sales on Store Card

$4,470

$6,610

Shopping Trips

23

38

Multichannel shopping

34%

40%

# of Customers

35,799

7,932

Table 2: Means of Matching Variables by Event Shopping
Since event shoppers are higher spend customers, we would expect them to be spending more in the postperiod, even in the absence of the promotion.
When we match the 7,932 event shoppers with a set of nonevent shoppers, we get a group of control customers that are much more similar in terms of their average shopping behavior:


EventBuyer


Matching Variable

Statistic

No

Yes

Sales

Mean

$4,700

$4,701

Sales on Store Card

Mean

$4,385

$4,385

Multichannel shopping

Mean

32%

32%

Shopping Trips

Mean

25.2

26.0





Sales

Std

$2,390

$2,390

Sales on Store Card

Std

$2,357

$2,357

Multichannel shopping

Std

47%

47%

Shopping Trips

Std

14.3

14.4





# of Customers

Sum

4,702

4,702

Table 3: Means and Standard Deviations of Matching Variables by Event Shopping
Note that we were only able to match 59% of the original 7,932 event shoppers using the criteria of a maximum difference in sales of $50, a maximum difference of 8 shopping trips and specifying an exact match for the multichannel shopping variable. By varying the strictness of the matching criteria, we can find matches for all event shoppers, but the differences between the two groups would then be considered too great by management to constitute a meaningful match.
Could we get a comparable match on more than 59% of event shoppers? We tried using the propensity score by developing a logistic regression model with the dependent variable of Event Shopping
(1=yes, 0=no). The independent variables were the same ones used in the fourvariable matching above. We developed models using both transformed and untransformed independent variables (matching
results were similar for the two models, so we chose the model with the untransformed variables as the variable means between the two groups were slightly closer). Using the estimated probability
of event shopping as our propensity score, we varied the maximum allowable difference in propensity score between matched pairs, settling on a maximum difference of 0.05 (about half the size of the
standard deviation of .11 of the estimated probability of event shopping). This criterion resulted in obtaining matches for 99% of the 7,932 potential matches, though at the cost of larger
differences between the two groups:
Table 4: Matching Using the Propensity Score
We chose the matching obtained from the 4variable match, as we judge it more important to get comparable groups rather than less comparable groups that use more of the potential matches. (There is obviously a tradeoff here, as we could get perfect matches if we only used 10 customers.)
Using the set of 4,702 matched customers in each group, we observe their mean pre and postperiod purchasing and calculate the estimated incremental sales due to the promotion as shown below. For this calculation, we have “capped” the values of the sales variables so that the influence of outliers on the average will be “dampened.” Instead of deleting outliers, we recoded any value above the 99th percentile to the 99th percentile value and any value below the 1st percentile to the 1st percentile value. In this way, we keep all the data and much of the extreme purchase behavior we see in this data, while at the same time adding some measure of “robustness” to the estimate of incremental sales.
Table 5: Incremental Sales Calculation
Note that a key measure is the postperiod sales: a major concern is that the event “steals” sales from the time period after the event. It is entirely possible that event shoppers could have simply done all their month’s shopping at the event and stopped shopping in the subsequent month. There is some evidence of this “shifting”, as the Event Buyers postperiod sales declines from $315 in the preperiod to $236 in the postperiod, whereas the nonbuyers increase slightly from $219 in the preperiod to $226 in the postperiod. However, the small decline in the postperiod by event shoppers is more than compensated by their event purchases, making for a healthy net gain over the event and postperiod.
Also of note in the above table is the preperiod difference of nearly $100, despite the matched purchase behavior over the 21month matching period. We find this pattern in all of our analyses of this kind: when matching over a 12month or longer time period, almost any onemonth period after the matching period shows substantial differences between the two groups. An interesting question for our future work is to what extent including preperiod purchase behavior as one of the matching variables will make the groups more comparable while still using most of the potential matches available.
Note also that the observedexpected method of estimating incremental sales is lower than the one typically used by this company and others (observed postperiod differences between the two groups; $377 in this example) as well as being lower than the $311 estimate using the groups selected by matching on the propensity score).
A major advantage of the “matched control group methodology” is the simplicity of the method and presentation of results. There is no complicated model, no “black box,” but observed average sales per customer over different time periods that are easily understood by management.
Another advantage is the simplicity of analysis and the freedom from the assumptions and constraints imposed by more commonly used parametric modeling techniques (e.g., ANCOVA or ordinary least squared regression). A typical modeling approach would require twostage modeling: a response model for predicting shopping in the postperiod and then another regressionbased model predicting sales among predicted buyers. Modeling retail sales data such as what we are dealing with in this example requires much data exploration, variable transformations, outlier identification and detecting violations of assumptions that are routinely present with such skewed data.
Matching also more easily allows examining overlap on the covariates between the two groups. For example, if members of one group have preperiod sales outside the range of the other group, this fact will be immediately obvious when examining the covariates by group. In contrast, this information would not be as obvious from the standard output of a regression analysis that is modeling postperiod sales over the entire “universe” of customers.
The major limitation of the matching approach is that it only adjusts for observed covariates, unlike random assignment, which effectively balances groups on both observed and unobserved variables. However, one could perform sensitivity analyses to show how large an impact any unobserved variable would have to be to account for the differences between groups not already adjusted for by the matching variables (an example of this in the context of binary covariates is in Rosenbaum.^{7}
The selfselection bias is a more serious concern. Peikes, et al^{8} has shown that the type of matching analysis presented here leads to conclusions quite different from an analysis where random assignment was employed in a study of the impact of an employment program designed to help disabled beneficiaries. In our example, we have exactly this case of selfselection: some customers decide to go to the event, while other customers, with similar purchase behavior, do not. There may be attitudinal differences that predispose some customers to shop at events and keep shopping subsequently, regardless of the particular promotion that the company invests in. Conceivably, the company may be attributing the increased sales to an exclusive catered shopping event when really there are a set of customers that are predisposed to ANY marketing that encourages additional shopping. Instead of spending a relatively large sum of money setting up an expensive, one night “black tie” event, there may be an opportunity to invite customers to shop on any of a number of particular days without going to the expense of setting up a “special event.” Testing of different invitations and event formats is now underway to better determine if similar results can be obtained at reduced costs.
Propensity matching and simple matching are workable alternatives for evaluating marketing events when random assignment is not possible. What are the advantages of other methods in comparison?
The regression discontinuity design, a specialized form of regression modeling, has been used to evaluate interventions where a cutoff score determines treatment assignment. For example, students with math skill scores below a cutoff “prescore” are assigned to an intervention program, while those above the cutoff are not. The regression model predicting posttest scores has the pretest scores and a dummy variable representing the treatment effect as independent variables. A significant dummy variable term would indicate an impact of the treatment over and above the pretest score, resulting in different slopes for the two groups, producing a visual “discontinuity” between the bestfitting lines predicting posttest scores for the two groups.
We have an analogous situation in our example, in that there is a “cutoff” score for being invited to the event: preperiod sales on the company’s credit card. However, as discussed earlier, the modeling is complex with this data in that the dependent variable is skewed (80% have no sales), the independent variables are skewed and nonconstant variance is present at different levels of both independent and dependent variables. These challenges can be overcome, but at the cost of more time, complexity and making the analysis more of a “black box” that is difficult for management to understand.
Intervention analysis using time series modeling has been successfully used to assess the impact of promotions for frequently purchased packaged goods. Each customer serves as their own control:
expected sales for the individual is forecast based on a time series model of past purchase behavior and compared with actual sales. Interventions, such as the marketing events described in our
example, may be explicitly modeled with an intervention variable. Results are then aggregated over all customers to estimate the impact of the event. This approach avoids the complexity of
regression modeling and matching, though data is likely to be sparse for some purchase categories. In our case, many customers have purchased 12 or more times per year, so this type of analysis may
be a promising one to pursue.
References:
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!