Statistical Significance and Credibility in Marketing
Statistical significance indicates how likely it is that a marketing campaign was directly responsible for its recipients’ behavior.
What is Statistical Significance?
When analyzing the results of marketing campaigns, statistical significance is a probabilistic indication of whether or not the observed campaign results would have likely occurred even in absence of the campaign.
Stated another way, statistical significance in campaign analysis is the parameter which indicates whether the campaign recipients’ behavior was the direct result of a specific campaign, or whether similar results might have been observed even had the campaign never been run.
When a campaign’s calculated uplift is determined to be statistically significant, there exists strong evidence that the campaign was responsible for the increase in spend (or any other uplift metric analyzed). However, in the event that an uplift result is deemed to be not statistically significant, then the marketer should not rely on that uplift result for decision-making. Instead, the marketer should engage in additional experimentation (for example, by making changes to the campaign or fine-tuning the recipient groups) with the goal of achieving satisfactory uplift results and statistical significance.
Calculating Uplift using Test and Control Groups
The most reliable way to measure campaign effectiveness is to split the campaign’s target audience into two separate groups and to compare the resulting behavior of each one: a test group (those customers that actually receive the campaign) and a control group (customers similar to those in the test group, but who received no campaigns during the campaign measurement period).
The goal is to understand how much impact the campaign had on any particular uplift metric (such as an increase in the amount that customers spent), by analyzing the differences in behavior between the test and control group.
However, the resulting uplift calculation may or may not be a reliable indicator of the impact of the campaign itself. In order to determine how likely the calculated uplift was, in fact, a direct result of the campaign, the statistical significance of the result must be calculated.
Calculating Statistical Significance for Customer Marketing Campaigns
There are various techniques for measuring campaign effectiveness in terms of uplift and statistical significance. The following describes the approach used by Optimove’s software.
When conducting campaign analysis, Optimove runs two statistical tests:
- Proportion test: This test compares the average response rates of the test and control groups, i.e., the percentage of each group’s customers who performed some tracked action during the campaign measurement period.
- T-test: This test determines whether the average per-customer metric results observed are statistically different between the test and control groups (e.g., did test group customers exhibit higher average spend amounts as compared with the customers in the control group?).
Optimove calculates a p-value for each of the two tests, and both p-values are then used to derive the tests’ statistical significance (p-value = 0.05 is used to indicate significance). When a campaign is deemed statistically significant, it implies that the campaign results were most probably not due to chance. Statistical significance indicates that the analysis results may be interpreted as being a reliable estimator of the “real” effect that the campaign had on its target audience.
For each of the proportion and T-test statistical tests, three factors determine whether the results were statistically significant (i.e., whether they have a p-value = 0.05):
- Sample size (total number of customers participating in the analysis). When more data points are contained in a dataset, the analysis of the dataset can be considered more reliable.
- Difference between the averages of the test and control groups (average response rate for the proportion test, and average metric value for the T-test). The greater the differences, the more reliable the results analysis is.
- Standard deviation (the standard deviation of the difference in response rates for the proportion test, and the standard deviation of the difference in average metric for the T-test). The standard deviation is one way of measuring the level of “noise” present in the data (technically it’s a measure of the scattering of the data around the average, so a group of customers all having the exact same uplift value will have a standard deviation of zero). When the data is “noisy” (with a high standard deviation), only a conspicuously large difference between the test and control groups would be considered significant (an analogy: in a very noisy environment, only a loud shout will be heard, while a gentle whisper will remain undetected). While just as important as the first two factors, many analysts tend to overlook standard deviation and are left wondering why a seemingly big difference between two averages is not considered statistically significant.
To summarize: the greater the sample size is, and the larger the differences in average results between the test and control groups, the greater the chances that the results will be considered statistically significant. However, as the standard deviation rises, the chances diminish. Remember: a big test-control difference with an equally big standard deviation doesn’t mean much. A campaign that achieves a large test-control difference alongside a relatively low standard deviation will most likely be significant.
Frequently Asked Questions about Measuring Marketing Campaign Effectiveness
Optimove users are a curious bunch, and because the Optimove software reports on the uplift and statistical significance of the campaigns it manages, Optimove users often seek to understand how to make the best use of these results when measuring marketing effectiveness. Here are answers to some questions we’re received on the topic of statistical significance. These answers will help clarify the practical implementations of the statistical marketing concepts discussed above.
Q: Is the lack of statistical significance in our campaign results due to the small group sizes of our target groups? Perhaps we should only consider the results of the analysis of a recurring campaign series? Or, should we increase the number of customers in each individual campaign to attempt to make them statistically significant?
There is no clear-cut answer to this question, mainly due to the fact there are various possible reasons that a result is not significant. The most likely reason that campaign results are not statistically significant is because the campaign itself is simply not effective! If a campaign is not successfully motivating customers, then increasing the number of recipients will obviously not increase the likelihood of seeing statistically significant results.
Statistical significance is affected by three main factors: the total number of customers targeted (which does not necessarily coincide with the irrelevant issue of how long a recurring campaign has been running), the difference in response patterns between the test and control groups, and the standard deviation (how “noisy” the results dataset is). There is no particular threshold for each factor above which a campaign becomes significant.
Results from very small groups should be analyzed in aggregate via the recurrence option in order to gain greater statistical power. However, keep in mind that accumulating more and more observations in the hope of getting a statistically significant result may end up being ineffective if the campaign itself is ineffective! So, you should focus on trying to create better campaigns, not inflating group sizes with the hopes of achieving statistical significance.
It is also worth mentioning that the flip side of this phenomenon occurs in campaigns with huge sample sizes, such as with over a million customers. Such campaigns tend to be statistically significant, even with very unimpressive test-control differences. In these situations, with such large sample sizes, the results may not be subjectively interesting, even though they are more likely to be statistically significant than with smaller campaigns.
Q: Are all the three factors (group size, difference in response and standard deviation) of equal importance?
It’s hard to rank these factors in terms of relative importance, as it depends on the specific campaign conditions. For example, in the case of a huge target group size, say one million customers, an additional person is of marginal influence, whereas for a tiny group an extra person might be very important.
Q: Regarding group size: I’ve noticed that, so far, the only time any of the campaigns are statistically significant for an individual campaign is when the number of customers targeted is at least 50. Can you please confirm that this is indeed the case?
There is no specific threshold for the number of recipients needed to achieve statistically significant results, as the statistical significance also depends on the standard deviation and customer behavior.
For example: Say a campaign is not successful, such that it embodies a “real” test-control response rate difference of, at most, 0.1%. In this case, you will probably need many more than 50 customers to get statistically significant results, as the group size needs to compensate for the campaign’s weak performance.
However, if the campaign essentially works extremely well and embodies a “real” test-control difference of a whopping 25%, then 50 customers will probably be enough to achieve statistically significant results.
Q: I’m keen to target my campaigns to smaller, more granular customer clusters, but won’t the small group sizes affect my ability to receive statistically significant campaign results?
It is less important to aim for statistical significance than it is to strive for effective and focused campaigns! In any case, you can always analyze a combined series of small, recurring campaigns to get results for a larger sample size. For example, if you send a particular campaign to 50 new customers every day, you should analyze the series as if it were a single campaign. Optimove’s campaign analysis report enables this approach. So, for example, over a two-week period, this “virtual campaign” would accumulate over 600 customers, which will likely be enough to generate reliable results.
However, even when doing this, it is possible that there will still not be enough control group customers to attain significant results. The solution to this is to select a higher proportion of recipients as the control group for a few campaign runs (even up to 50% in extreme cases) to ensure at least a minimal number of control group customers.
The point here is not that statistical significance isn’t important (it’s extremely important), but that, in general, you should try to reach statistical significance through focused and effective campaigns, not by tweaking the number of campaign recipients. Statistical significance is not an objective in and of itself, but something that indicates whether or not the campaign analysis results are certain enough to rely upon (think of it more as the messenger rather than the message itself).
A good case in which to prefer a not-so-granular campaign (with a large recipient group) to a small, granular campaign, is when you are unsure how to target that particular customer segment in a granular fashion. If you’re unsure how to approach some customer segment, and how to break it down into more granular groups, then starting with a relatively large and heterogeneous group is a solid option. The larger recipient base may enable faster learning and, more importantly, it’s better to start with something than to stall and do nothing. However, such a strategy should always be regarded as a first step, keeping in mind that after some learning period, you should be subdividing the group into granular sub-groups in a way that makes business sense.
Q: Can you explain “standard deviation” in more detail?
Standard deviation is calculated from the data (just like the mean or the maximum are), and is a measure of the tendency of the observed data points to not be neatly clustered together, but rather spread out from each other with no common anchor (data outliers, for example, increase the standard deviation). Standard deviation can be thought of as a “noisiness” gauge for data.
Compared with other “noisiness” metrics, it enjoys an interpretability advantage: it serves as the buffer between what can be considered pure chance and potentially having effective results. A test-control difference that doesn’t exceed the standard deviation in value is really something we could have expected to happen anyway, by sheer chance, as if no campaign was ever run. A statistically significant campaign, however, will be one whose results exceed the standard deviation by some non-trivial amount (usually at least 1.7 times the standard deviation), i.e., its results stick out above the natural randomness factor enough to indicate an actual cause-effect relationship.
Here’s an example: Let’s say a campaign’s test group spent an average of $120 during the campaign’s measurement period, and the control group spent an average of $100, yielding a $20 difference. If the calculated standard deviation were $50, then the result is still deeply within the realm of the data’s natural randomness (because $20 < $50) so the test is not statistically significant. If, however, the standard deviation was only $5, then our result would have far exceeded the natural noisiness of the data ($20 is four times higher than $5). This would imply that the campaign generated results that were far better than we would have expected by pure chance. Therefore, the result is significant.