We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Evaluating the Impact of Promotions without Randomly Assigned Control Groups

Originally published June 11, 2009

Although randomly assigned test and control groups are considered the “gold standard” for measuring campaign performance, there are many designs without random assignment that are useful to marketers. In this article, we describe one such design, the Pre-Post/Test-Control group design, and contrast variable matching and propensity matching techniques to identify customers that can serve as control groups. We also describe a measure of “incremental sales” as the difference between “expected” and observed sales in the post-period, where “expected” sales are estimated by weighting the pre-period test group sales by the ratio of control group post-period sales to control group pre-period sales. We illustrate this design with the matched control group analysis in a case study where a retailer is interested in evaluating the impact of an in-store promotion.

Here, we will describe a method to evaluate the impact of promotions without randomly assigning customers to control groups. In Section 1, we discuss several alternative designs and measures of incremental sales – sales that are estimated to be due solely to the impact of a promotion. Section 2 describes the use of the variable matching and the propensity score to find matched control groups.1 Section 3 illustrates estimating incremental sales using both methods to find control groups that can be used to evaluate the impact of an actual in-store promotion. Section 4 discusses the limitations of the approach and comparisons with other methods, ending with future directions for research.

Section 1. Designs to Measure Incremental Sales

The ability to test and measure results and evaluate the impact of promotions is a major strength of the direct marketing discipline. There are many different testing designs to harness this strength. We discuss three examples of how these designs are used to measure the impact of promotions. A detailed discussion of the strengths and weaknesses of these designs and others can be found elsewhere.2

Pre/Post Design

In this design, we:

  1. Observe sales (or response, profitability, etc.) in the pre-period.

  2. Apply the marketing stimulus (e.g., an invitation to an in-store event).

  3. Observe sales in the post-period.

This is a very "natural" type of marketing experimentation; it is easy to set up, inexpensive and requires tracking only one group of customers. Any change from the pre to post-period is assumed to be due to the new marketing program.

The weakness of this design, of course, is that outside stimuli can affect the answer; it assumes similar conditions hold in the pre- and post-period. The longer the pre- and post-periods, the more likely other conditions will affect the test.

Seasonality is an especially potent threat to the validity of this design: changes from one time period to another may be due to normal changes in customer buying behavior, rather than to the impact of a promotion. Retailers typically attempt to remedy this by using last year’s sales as the “pre-period,” but other differences, such as the economic climate between one year and the next, make causal inferences problematic.

Test/Control Design

Most direct marketers are familiar with test/control designs for evaluating different copies, packages, or offers. A group of customers are randomly assigned into test and control groups. The test group receives the new marketing stimulus, while the control group does not.

Since test and control groups don't differ in any systematic way, differences between the two groups can be attributed to the marketing stimulus, and not to any outside effects. This gives a more accurate "read" than the pre/post design, but at additional cost: two groups and random assignment procedures are required. Nevertheless, this design is the “gold standard” against which all other designs are measured.

Pre/Post and Test/Control Design

This design combines the advantages of the two previous designs for evaluating the impact of promotions over a given time period. In addition to making groups of customers comparable across both observed and unobserved covariates, it explicitly adjusts for changes between the pre and post periods that we would expect to see in the absence of the promotion.

Its relative disadvantages include more time to set up, requiring two different groups; it requires random assignment procedures; and the concept of “expected sales” – integral to the design, as explained in detail below – can be harder to present and explain.

Done properly, this design provides unambiguous proof of program effectiveness and estimates the incremental sales due to the marketing program (i.e., sales that would not have been gained without the marketing program). As such, it is ideal for testing programs where return on investment (ROI) decisions need to be made regarding the allocation of marketing dollars.

There are several measures of incremental sales using this design, one of which is described below:3
    
Incremental Sales = Observed Sales - Expected Sales

where

'Observed' sales are from the test group in the post-period
'Expected' sales are from the test group in the post-period

What is the best estimate of the expected post test group sales, i.e., the sales that would have occurred without the promotion?

One estimate is the pre-period sales, but adjusted by what happened to the control group from the pre to post-period:

Expected Post Test Group Sales = Pre-Period Test Sales * (Post Control Sales / Pre Control Sales)

Note how the pre-period Test Sales are adjusted: when the Post Control group has higher sales than the Pre Control, the Pre Test group's sales will be multiplied by a number greater than 1; when the post-period sales are lower than the pre-period control group, the Pre-period Test group sales will be multiplied by a number less than 1, adjusting the "expected" sales downward.

In the following example, a group of heavy users are randomly assigned to test and control groups. The test group is invited to join a frequency program. The pre and post-periods are 3 months in duration, and the dollar amount represents average monthly sales per customer:

alt

Table 1: Example of Pre/Post and Test/Control Incremental Sales
  

Both groups decline in the post-period, but the control group declines much more without the program. The "Change" quantity is the raw change, Post Sales minus Pre Sales, the "Post-Period Expected" is what we would have expected the test group to do if there was not a promotion, calculated as follows:

Pre-Period Test Sales x (Post Control Sales / Pre Control Sales)
= $32.00  x ($24.32 / $31.68)
= $32.00  x 0.77
= $24.57

The difference between the Post Test and Expected Post Test is the Incremental Change due to the promotion: $25.60 - $24.57 = $1.03.

The incremental percent change is calculated as the incremental change / Post Expected: ($1.03 / $24.57) x 100 = 4 %.

This represents the average increase in sales per customer we would gain by rolling out the frequency program to all heavy users. Multiplying $1.03 by the number of heavy users and then by 12 months would give the annual incremental sales due to the promotion for all heavy users (assuming the test group is representative of all heavy users).

Note how this design is able to quantify the program's positive effect, in spite of a decline in the post-period compared to the pre-period. In the absence of the program the decline would have been even greater!

This measure of incremental sales has intuitive appeal and uses the available pre-period information; however, there is no simple statistical significance test or measurement of the error in the estimate as there is for the simple test/control group design. In this case, re-sampling may be required to construct estimates of the error.

Advantages and Disadvantages of Testing Designs

Below is a summary of the designs and their advantages and disadvantages:

 Design
 Advantages
 Disadvantages
 Pre/Post Inexpensive (one group) Takes time
  Easy to set up Outside stimuli may affect outcome
  Easy to explain  
     
 Test/Control Easy to read More expensive (two groups required)
  Outside stimuli don't affect outcome Random assignment (or more complex matching procedures) required 
  Quick  
  Statistical inference easy  
     
Pre/Post and Test/Control

Most accurate and comprehensive measurement of
incremental sales

Expensive
    Harder to explain
    Statistical inference harder (resampling may be required)

Section 2. Matched Control Groups

We have argued that the Pre/Post and Test/Control design provides a useful measure of incremental sales. However, it requires a control group and often there cannot be random assignment for ethical or practical business reasons (e.g., we may not want to lose any revenue opportunity by excluding a group of customers from a promotion). How do we obtain a control group without random assignment?

Fortunately, this problem has been extensively researched and many solutions are available.4 One technique we have found useful is the matching algorithm described by Rosenbaum5 and implemented as a SAS macro.6 This particular matching algorithm is most useful when there are at least twice as many untreated customers as there are treated customers. This makes it more likely that a set of untreated customers can be found that match as closely as possible the treated customers on a set of variables related to the outcome. The motivation for matching the two groups is that if they have similar pre-period purchase behavior and other characteristics (such as demographics and attitudes), then any differences we see during or after the promotion can be more plausibly attributed to the promotion and not to pre-existing differences between the groups on the covariates. (One exception that remains a threat to this plausible inference is self-selection biases when customers “self-select” into receiving the treatment; this bias is explored in more detail below).

To match the groups across a set of covariates, Rosenbaum conceived of the matching problem as a network flow optimization problem, amenable to linear programming solutions, where specialized algorithms exist to find the flow through a network with minimum cost. As implemented by Bergstralh and Kosanke, the “greedy” matching algorithm finds matched pairs of test and control customers that differ maximally by an amount specified by the user for each individual covariate.6 The algorithm is called “greedy” in that it will find a match with the first potential control that satisfies the maximum difference allowed between a test and control pair for each covariate, even if another control may exist that is more similar to the treated customer (i.e., that has a smaller differences over the matching covariates).

The “optimal” matching algorithm considers all potential controls to a given treated customer and selects the one that not only satisfies the maximum difference specified by the user, but the pair that minimizes the differences over all matching variables. As might be expected, this algorithm takes more time and may become prohibitively time-consuming when there are many covariates. In practice, when we have three times as many potential controls as treated customers, we have found the ‘greedy’ algorithm to yield acceptable matches.

A potential problem in finding acceptable matches with either of these algorithms is the amount of time for the program to run when the number of variables or the number of records becomes large (e.g., more than six covariates and more than 500,000 records on a single-processor Windows XP machine). Also, the probability of finding good matches diminishes as the number of variables increases. In this case, we have found using the propensity score as the sole matching variable with either of these algorithms finds control groups that are acceptable matches.

The propensity score is defined as the probability of assignment to the treated group, usually estimated with a logistic regression model with treatment (0/1) as the dependent variable. Customers with the same or similar propensity scores produce groups that will be matched, or at least balanced, with respect to the covariates in the model. (‘Balanced’ in this case means that even if there exist pairs that are not matched on any one variable, the two groups will have similar means and variances on the covariates.) It is important to note that the propensity score is estimated without reference to the outcome in order to find matched groups that would have occurred had customers been randomly assigned to the groups.

In our example below, we contrast matching with four variables to matching using only one variable – the propensity score.

Section 3. Matching Example: Evaluating the Impact of an In-Store Promotion

A specialty fashion retailer invites selected, high-spending customers who use the store’s branded credit card to an exclusive, in-store shopping event. The event takes place one night when the store closes and only invited customers may shop amidst a festive, catered “shopping party” where there are no crowds, succulent hors d'oeuvre and one-on-one attentive service.

A vital issue to management is whether the event merely shifts sales earlier at the expense of longer-term sales that would have occurred in the absence of the promotion. The question can be phrased as, “Does the event generate incremental sales, over and above what would have been expected, after the event occurs?” This question motivates us to evaluate the impact of the promotion using the pre/post and test/control design, where the pre-period is a month before the event and the post-period is the month after the event.

In this example, we analyzed a subset of customers invited to the store event in one region of the country (all data in this example have been disguised). Nearly 44,000 customers were invited to the event; of these, 18% (7,932 of 43,731) actually shopped at the event. The other 82% of customers who did not shop the event are the “pool” of customers from which we will select our matched control group. Note that there is a potential self-selection bias, in that people who shop the event may do so for reasons that are not related to the purchase behavior variables we used in matching and may therefore be an uncontrolled source of variation between the two groups.

We matched on four variables that were measured over the 12 months prior to mailing of the invitation:

  • Overall Sales

  • Sales on the store card

  • Number of shopping trips

  • Multi-channel shopper (shopped both online and at a store; a binary variable with values of 0 or 1)

When we first proposed analyzing these events using this matching methodology, management asked, “Why not just compare the shoppers and non-shoppers?”

The table below shows how different these two groups of shoppers are in terms of the matching variables:

Matching Variable

 

EventBuyer

 

No

 

Yes

 

Sales

 

$5,206

 

$7,745

 

Sales on Store Card

 

$4,470

 

$6,610

 

Shopping Trips

 

23

 

38

 

Multi-channel shopping

 

34%

 

40%

 

# of Customers

 

35,799

 

 7,932

 

Table 2: Means of Matching Variables by Event Shopping

Since event shoppers are higher spend customers, we would expect them to be spending more in the post-period, even in the absence of the promotion.

When we match the 7,932 event shoppers with a set of non-event shoppers, we get a group of control customers that are much more similar in terms of their average shopping behavior:

 

 

 

 

EventBuyer

 

Matching Variable

 

Statistic

 

No

 

Yes

 

Sales

 

Mean

 

$4,700

 

$4,701

 

Sales on Store Card

 

Mean

 

$4,385

 

$4,385

 

Multi-channel shopping

 

Mean

 

32%

 

32%

 

Shopping Trips

 

Mean

 

25.2

 

26.0

 

 

 

 

 

 

 

 

 

Sales

 

Std

 

$2,390

 

$2,390

 

Sales on Store Card

 

Std

 

$2,357

 

$2,357

 

Multi-channel shopping

 

Std

 

47%

 

47%

 

Shopping Trips

 

Std

 

14.3

 

14.4

 

 

 

 

 

 

 

 

 

# of Customers

 

Sum

 

4,702

 

4,702

 

Table 3: Means and Standard Deviations of Matching Variables by Event Shopping

Note that we were only able to match 59% of the original 7,932 event shoppers using the criteria of a maximum difference in sales of $50, a maximum difference of 8 shopping trips and specifying an exact match for the multi-channel shopping variable. By varying the strictness of the matching criteria, we can find matches for all event shoppers, but the differences between the two groups would then be considered too great by management to constitute a meaningful match.

Matching Using the Propensity Score

Could we get a comparable match on more than 59% of event shoppers? We tried using the propensity score by developing a logistic regression model with the dependent variable of Event Shopping (1=yes, 0=no). The independent variables were the same ones used in the four-variable matching above. We developed models using both transformed and untransformed independent variables (matching results were similar for the two models, so we chose the model with the untransformed variables as the variable means between the two groups were slightly closer). Using the estimated probability of event shopping as our propensity score, we varied the maximum allowable difference in propensity score between matched pairs, settling on a maximum difference of 0.05 (about half the size of the standard deviation of .11 of the estimated probability of event shopping). This criterion resulted in obtaining matches for 99% of the 7,932 potential matches, though at the cost of larger differences between the two groups:

alt

Table 4: Matching Using the Propensity Score

We chose the matching obtained from the 4-variable match, as we judge it more important to get comparable groups rather than less comparable groups that use more of the potential matches. (There is obviously a trade-off here, as we could get perfect matches if we only used 10 customers.)

Incremental Sales Calculation

Using the set of 4,702 matched customers in each group, we observe their mean pre and post-period purchasing and calculate the estimated incremental sales due to the promotion as shown below. For this calculation, we have “capped” the values of the sales variables so that the influence of outliers on the average will be “dampened.” Instead of deleting outliers, we recoded any value above the 99th percentile to the 99th percentile value and any value below the 1st percentile to the 1st percentile value. In this way, we keep all the data and much of the extreme purchase behavior we see in this data, while at the same time adding some measure of “robustness” to the estimate of incremental sales.

alt

Table 5: Incremental Sales Calculation

Note that a key measure is the post-period sales: a major concern is that the event “steals” sales from the time period after the event. It is entirely possible that event shoppers could have simply done all their month’s shopping at the event and stopped shopping in the subsequent month. There is some evidence of this “shifting”, as the Event Buyers post-period sales declines from $315 in the pre-period to $236 in the post-period, whereas the non-buyers increase slightly from $219 in the pre-period to $226 in the post-period. However, the small decline in the post-period by event shoppers is more than compensated by their event purchases, making for a healthy net gain over the event and post-period.

Also of note in the above table is the pre-period difference of nearly $100, despite the matched purchase behavior over the 21-month matching period. We find this pattern in all of our analyses of this kind: when matching over a 12-month or longer time period, almost any one-month period after the matching period shows substantial differences between the two groups. An interesting question for our future work is to what extent including pre-period purchase behavior as one of the matching variables will make the groups more comparable while still using most of the potential matches available.

Note also that the observed-expected method of estimating incremental sales is lower than the one typically used by this company and others (observed post-period differences between the two groups; $377 in this example) as well as being lower than the $311 estimate using the groups selected by matching on the propensity score).

Section 4. Comparison with Other Methods, Limitations, and Future Research

A major advantage of the “matched control group methodology” is the simplicity of the method and presentation of results. There is no complicated model, no “black box,” but observed average sales per customer over different time periods that are easily understood by management.

Another advantage is the simplicity of analysis and the freedom from the assumptions and constraints imposed by more commonly used parametric modeling techniques (e.g., ANCOVA or ordinary least squared regression). A typical modeling approach would require two-stage modeling: a response model for predicting shopping in the post-period and then another regression-based model predicting sales among predicted buyers. Modeling retail sales data such as what we are dealing with in this example requires much data exploration, variable transformations, outlier identification and detecting violations of assumptions that are routinely present with such skewed data.

Matching also more easily allows examining overlap on the covariates between the two groups. For example, if members of one group have pre-period sales outside the range of the other group, this fact will be immediately obvious when examining the covariates by group. In contrast, this information would not be as obvious from the standard output of a regression analysis that is modeling post-period sales over the entire “universe” of customers.

Limitations

The major limitation of the matching approach is that it only adjusts for observed covariates, unlike random assignment, which effectively balances groups on both observed and unobserved variables. However, one could perform sensitivity analyses to show how large an impact any unobserved variable would have to be to account for the differences between groups not already adjusted for by the matching variables (an example of this in the context of binary covariates is in Rosenbaum.7

The self-selection bias is a more serious concern. Peikes, et al8 has shown that the type of matching analysis presented here leads to conclusions quite different from an analysis where random assignment was employed in a study of the impact of an employment program designed to help disabled beneficiaries. In our example, we have exactly this case of self-selection: some customers decide to go to the event, while other customers, with similar purchase behavior, do not. There may be attitudinal differences that predispose some customers to shop at events and keep shopping subsequently, regardless of the particular promotion that the company invests in. Conceivably, the company may be attributing the increased sales to an exclusive catered shopping event when really there are a set of customers that are predisposed to ANY marketing that encourages additional shopping. Instead of spending a relatively large sum of money setting up an expensive, one night “black tie” event, there may be an opportunity to invite customers to shop on any of a number of particular days without going to the expense of setting up a “special event.” Testing of different invitations and event formats is now underway to better determine if similar results can be obtained at reduced costs. 

Future Research

Propensity matching and simple matching are workable alternatives for evaluating marketing events when random assignment is not possible. What are the advantages of other methods in comparison?

The regression discontinuity design, a specialized form of regression modeling, has been used to evaluate interventions where a cut-off score determines treatment assignment. For example, students with math skill scores below a cut-off “pre-score” are assigned to an intervention program, while those above the cut-off are not. The regression model predicting post-test scores has the pre-test scores and a dummy variable representing the treatment effect as independent variables. A significant dummy variable term would indicate an impact of the treatment over and above the pre-test score, resulting in different slopes for the two groups, producing a visual “discontinuity” between the best-fitting lines predicting post-test scores for the two groups.

We have an analogous situation in our example, in that there is a “cut-off” score for being invited to the event: pre-period sales on the company’s credit card. However, as discussed earlier, the modeling is complex with this data in that the dependent variable is skewed (80% have no sales), the independent variables are skewed and non-constant variance is present at different levels of both independent and dependent variables. These challenges can be overcome, but at the cost of more time, complexity and making the analysis more of a “black box” that is difficult for management to understand.

Intervention analysis using time series modeling has been successfully used to assess the impact of promotions for frequently purchased packaged goods. Each customer serves as their own control: expected sales for the individual is forecast based on a time series model of past purchase behavior and compared with actual sales. Interventions, such as the marketing events described in our example, may be explicitly modeled with an intervention variable. Results are then aggregated over all customers to estimate the impact of the event. This approach avoids the complexity of regression modeling and matching, though data is likely to be sparse for some purchase categories. In our case, many customers have purchased 12 or more times per year, so this type of analysis may be a promising one to pursue. 
 

References:

  1. Rubin, D. Matched Sampling for Causal Effects, Cambridge University Press, (2006).
  2. Cook, T. and Campbell, R. Quasi-Experimentation: Design and Analysis Issues for Field Settings, 1st ed., Houghton Mifflin Company, (1979).
  3. Rooney (2008). I have been unable to find a citation in the literature for this measure of incremental sales. I first came across it in 1989 at Citicorp POS Information Services where it was used for evaluating the impact of coupon promotions in grocery stores.
  4. Rosenbaum, P. and Rubin, D. “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika, 70, 41-55, (1983).
  5. Rosenbaum, R. “Optimal Matching for Observational Studies,” Journal of the American Statistical Association, Vol. 84, No. 408, 1024-1032, (1989).
  6. Bergstralh , E. and Kosanke, J. “SAS Macro Match”, http://mayoresearch.mayo.edu/mayo/research/biostat/sasmacros.cfm, (1995).
  7. Rosenbaum, P. R., "Sensitivity Analysis for Certain Permutation Tests in Matched Observational Studies," Biometrika, 74, 13-26 (1987); Correction, 75, 396 (1988).
  8. Peikes, D.N., Moreno, L. and Orzol, S.M. “Propensity Score Matching: A Note of Caution for Evaluators of Social Programs,” The American Statistician, August 2008, Vol. 62, No. 3, (2008).
  • Patrick Rooney, Ph.D.
    Patrick is a proven professional with more than 20 years experience in deriving insight from statistical data analysis and modeling.  His expertise in quantitative analysis and statistical modeling have helped companies in the retail, financial services, and software sectors to optimize their marketing and merchandising activities. Prior to his consulting work at The Modeling Agency, Patrick was at Nordstrom for 10 years where he was responsible for supporting decision-making throughout the organization using mathematical and statistical techniques.  He managed a team of analysts and statisticians that solved problems in real estate strategy and growth optimization modeling; human capital analytics; retail and online customer segmentation; direct marketing models; assessing the impact of marketing promotions; and identifying the return on investment (ROI) of all marketing media and activities via marketing mix models.

    Patrick is a member of the Direct Marketing Association and has been a featured speaker at several of the annual DMA conferences, as well as a member of the American Marketing Association and the American Statistical Association. He may be reached by email at patrick.rooney.numbersLLC@gmail.com or at (831) 566-3256.


 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!