Statistical Significance and Credibility in Marketing
Statistical credibility indicates how likely it is that a marketing campaign was directly responsible for its recipients’ behavior.
What is Statistical Significance?
When analyzing the results of marketing campaigns, statistical significance is a probabilistic indication of whether or not the observed campaign results would have likely occurred even in absence of the campaign.
Stated another way, statistical significance in campaign analysis is the parameter which indicates whether the campaign recipients' behavior was the direct result of a specific campaign, or whether similar results might have been observed even had the campaign never been run.
What is Statistical Credibility?
Like statistical significance, statistical credibility indicates the degree of confidence in the uplift calculated for a campaign (or a series of campaigns).
When a campaign's calculated uplift is determined to be statistically credible, there exists strong evidence that the campaign was responsible for the increase in spend (or any other uplift metric analyzed). However, in the event that an uplift result is deemed to be not statistically credible, then the marketer should not rely on that uplift result for decision-making. Instead, the marketer should engage in additional experimentation (for example, by making changes to the campaign or fine-tuning the recipient groups) with the goal of achieving satisfactory uplift results and statistical credibility.
What is the Difference between Statistical Significance and Statistical Credibility?
Contemporary statisticians prefer the Bayesian inference used to determine statistical credibility to the older frequentist approaches used to determine statistical significance. The former has been made possible over the past few decades with the advent of cheap and efficient computing power. The Bayesian notion of a credible interval is intuitively close to the classic notion of a confidence interval, despite their (credible and significant) philosophical differences.
In practice, Bayesian and frequentist conclusions almost always coincide for campaign analysis, when there is enough data. However, when the number of data points is limited – as is often the case in the real world – the Bayesian statistical credibility approach enjoys several advantages, leading marketers to make better decisions.
Calculating Uplift using Test and Control Groups
The most reliable way to measure campaign effectiveness is to split the campaign's target audience into two separate groups and to compare the resulting behavior of each one: a test group (those customers that actually receive the campaign) and a control group (customers similar to those in the test group, but who received no campaigns during the campaign measurement period).
The goal is to understand how much impact the campaign had on any particular uplift metric (such as an increase in the amount that customers spent), by analyzing the differences in behavior between the test and control group.
However, the resulting uplift calculation may or may not be a reliable indicator of the impact of the campaign itself. In order to determine how likely the calculated uplift was, in fact, a direct result of the campaign, the statistical credibility of the result must be calculated.
Calculating Statistical Credibility for Customer Marketing Campaigns
There are various techniques for measuring campaign effectiveness in terms of uplift and statistical credibility. The following describes one approach (which is the one used by Optimove's CDP software).
When conducting campaign analysis, this approach considers two distinct performance factors:
- Response Rate: the number of customers who contributed to the uplift metric, out of the total number of customers in the group (e.g., those customers who made a transaction at any time during the measurement period)
- Average Value: the average contribution to the uplift metric by customers in the group (e.g., average order value of all customers in the group who placed an order during the measurement period)
Many campaigns drive independent, or even opposing, effects on these two factors. For example, a deep discount on a specific offering is likely to increase the number of conversions, while simultaneously reducing the average revenue per transaction. In terms of the factors we're discussing, such a campaign improves Response Rate among the test group's members (compared with members of the control group), but decreases the Average Value.
Calculating the campaign's uplift in terms of the combined effect of both factors works like this:
For each of the two factors, use the Bayesian Monte Carlo method to calculate the probability that the performance of the test group is better than that of the control group. Probabilities above 95% or below 5% are considered as an indication that the observed difference in performance between the test and the control groups is indeed due to the campaign, so such a difference should be considered a statistically credible uplift calculation. On the other hand, less conclusive probabilities suggest that the difference in performance might not necessarily be attributed to the campaign – rather, there is a non-negligible chance that the observed difference is "random." In other words, the difference might have been observed when comparing any two arbitrary groups of customers, so that this difference in performance should not be considered a statistically credible uplift calculation.
A credible positive uplift means that the campaign was effective, successfully influencing customers towards desirable behaviors. Similarly, a credible negative uplift means that the campaign was detrimental, driving customers into undesirable behaviors. Naturally, for lower-is-better types of metrics (such as churn rate, cancellation rate and return rate), the opposite conclusions hold. Uplifts that are not credible – whether positive or negative – simply mean that there is not enough evidence to reach a definitive conclusion either way.
There are three major factors that influence these probabilities:
- The sample size: A larger number of customers leads to higher precision (lower variance) in the distributions of the Response Rate and Average Value for each group. This, in turn, pushes the probability that one group outperforms the other away from 50% (equal chances of either one being better than the other), and towards the credibility thresholds of 95% (test group is better) or 5% (control group is better).
- The variance of observed values: When values are spread out over large ranges, proper statistical inference must allow for wider intervals of possible underlying means, pushing the probability of one group outperforming the other away from the credible threshold, and closer to the 50% midpoint.
- The magnitude of the difference: With more dramatic differences in performance between the test and control groups (in Response Rate, Average Value, or both), the probabilities get pushed towards the credibility thresholds, allowing statistically credible conclusions to be reached, even with fewer data points and/or higher variances.
Frequently Asked Questions about Measuring Marketing Campaign Effectiveness
Optimove users are a curious bunch, and because the Optimove software reports on the uplift and statistical credibility of the campaigns it manages, Optimove users often seek to understand how to make the best use of these results when measuring marketing effectiveness. Here are answers to some questions we've received on the topic of statistical credibility. These answers will help clarify the practical implementations of the statistical marketing concepts discussed above.
Q: Is the lack of statistical credibility in our campaign results due to the small group sizes of our target groups? Perhaps I should only consider the results of the analysis of a recurring campaign series? Or, should I increase the number of customers in each individual campaign in an attempt to make them statistically credible?
There is no clear-cut answer to these questions, mainly due to the fact that there are various reasons why a result may not be credible. As mentioned above, there are three major factors in play: the total number of customers targeted, the difference in response patterns between the test and control groups, and the variance (how "noisy" the results are). There is no particular threshold for each factor above which a campaign becomes credible. Rather, it is the interaction of the three that comprise the probability of the test group actually outperforming the control group (or vice versa). Moreover, the time period during which a recurring campaign is run is not relevant in terms of its likelihood of becoming credible, since the time period does not necessarily coincide with the actual number of customers targeted (although in many cases, looking at longer time windows does allow meaningful analysis when the target groups are small).
It is important to keep in mind that campaign results might not be statistically credible because the campaign itself is simply not effective! If a campaign is, in fact, not successfully influencing customers to behave one way or another, increasing the number of recipients will obviously not lead to statistically credible results.
It is also worth mentioning that the flip side of this phenomenon occurs in campaigns with huge sample sizes, say over a million customers. Such campaigns tend to be statistically credible, even with fairly minor test-control differences. In such situations, with such large sample sizes, the results may not be subjectively meaningful – such campaigns are prime candidate for further personalization/granularization.
Q: Regarding group size, I've noticed that, so far, the only time any of my campaigns are statistically credible for an individual campaign is when the number of customers targeted is at least 50. Can you please confirm that this is indeed the case?
There is no specific threshold for the number of recipients needed to achieve statistically credible results, as statistical credibility also depends on the mean and variance of the customers' behavior. For example, take a campaign that is not successful, such that it embodies a "real" test-control response rate difference of, at most, 0.1%. In this case, you will probably need many more than 50 customers to get statistically credible results, as the group size needs to compensate for the campaign's relatively weak impact. However, if the campaign works extremely well and embodies a "real" test-control difference of a whopping 25%, then having even 30 customers is probably enough to achieve statistical credibility when analyzing the campaign results.
Q: I'm keen to use more granular customer target groups, but won't the small group sizes affect my ability to receive statistically credible campaign results?
It is less important to aim for statistical credibility than it is to strive for focused and effective campaigns. In any case, you can always analyze a combined series of small, recurring campaigns to get results for a larger sample size. However, even when doing so, it is possible that there will still not be enough control group customers to attain statistically credible results. One possible solution is to select a higher proportion of recipients as the control group (even up to 50% in extreme cases), for a few campaign runs, to ensure that you have a number of control group customers large enough to get useful results.
The point here is not that statistical credibility isn't important (it's extremely important), but that, in general, you should try to reach statistical credibility through focused and effective campaigns, not by tweaking the number of campaign recipients. Statistical credibility is not an objective in its own right, but an important indication regarding whether or not the campaign analysis results are certain enough to be relied upon (think of it more as the messenger, rather than the message itself).
A good example of a situation in which to prefer a not-so-granular campaign (with a large recipient group) to a small, granular campaign, is when you are unsure how to target that particular customer segment in a granular fashion. If you're unsure how to approach some customer segment, and how to break it down into more granular groups, then starting with a relatively large and heterogeneous group is a solid option. The larger recipient base may enable faster learning and, more importantly, it's better to start with something than to stall and not do anything. However, such a strategy should always be regarded as temporary, keeping in mind that after some learning period, you should be subdividing the group into granular sub-groups in a way that makes business sense (look out for sub-group recommendations made by Optibot, indicated by the lightbulb icon in campaign analysis results).
Q: If striving for statistical credibility shouldn't be the goal, then why can't the marketing plan performance summary give a figure which takes into account all KPI increases and decreases, and not just the statistically credible ones?
If a result is not statistically credible, then that result may be due to nothing other than chance – it tells you little or nothing, and can be misleading. Including such results may lead you to make decisions based on random chance, rather than on the actual performance of their campaigns, a situation that would likely lead to sub-optimal performance of your overall marketing plan.
Q: Can you explain "variance" in more detail?
Variance is a measure of the "spread" of a distribution, or the distance between the mean of the distribution and a set of data points sampled from that distribution. High variance means that many sampled points would end up quite far away from the mean, while low variance means that almost all sampled points would be very close to the mean. Similarly, the sample variance is a measure of this "spread" in a specific sample – low variance means the sample contains points that are very close together, while high variance means the points are widely spread apart.
In the context of campaign analysis, the underlying distributions for the test and control groups are unknown; the goal is to compare their unknown means based on the data points in a sample, generated by running the campaign. When the sample variance is low, the mean of an underlying distribution is (probabilistically) restricted to a fairly narrow range. When the sample variance is high, the underlying mean may belong to a much wider range. Intuitively, if those unknown means belong (probabilistically) to non-overlapping ranges, then it follows that one is almost certainly larger than the other. The narrower those ranges get, the smaller their overlap. Thus, low sample variance leads to higher credibility – the actual computation is actually slightly different, but the intuition holds.
Last updated May 2019