Beware the Giraffes in Your Data!
During the past few years, I’ve spent a lot of time working with marketing departments who spend much of their time analyzing data, conducting research and tracking performance metrics. They are always on the lookout for exciting new insights which can translate into action items and provide strategic advantage. Unfortunately, marketers and analysts often miss key opportunities to spot these insights – and even make the wrong decisions – because they fail to account for the “giraffe effect” in their data.
Giraffes are what I call portions of data which dominate the rest of the data – and hide important insights. Sometimes they even lead to wrong conclusions. This can happen in different ways and for different reasons, as we will see below.
The Giraffe, the Fox, the Cat and the Mouse
Let’s say you’re out watching animals in a nature reserve. Undoubtedly, when you spot a majestic giraffe in your binoculars, you’re going to take a good look at him. Meanwhile, many of the other, smaller animals will all just seem, well, small. You won’t notice that there are significant differences in height among the smaller animals, especially as compared to the giraffe.
However, if you can take your eyes off the giraffe for a minute and zoom your binoculars into the smaller animals on the plain, an amazing thing happens: suddenly, you become aware that the differences in size between the animals are actually much larger than you had first realized.
This is a very simple example of the giraffe effect. When people look at a set of data which includes some very large, dominant members, important differences among the other data in the set often disappear from view.
A Website Analytics Example
If you’re not already familiar with the following images, they are known as “heatmaps.” They reveal the areas of most intensive visitor mouse movements and mouse clicks on a webpage. Red areas indicate the most mouse activity, blue with the least.
In this first heatmap, we see only one dark red area, namely the login password field. Because many of the visitors to this page are already registered users of the site, it makes sense that such a large percentage of mouse activity is centered on the login area. However, because all the mouse activity data is aggregated here, important information about where non-registered visitors are looking and clicking is hidden from the analyst’s view.
Once the analyst drills down and removes the giraffe from the data (the registered users), he suddenly sees a view of the data that is much more revealing as to the visitors’ areas of interest. Specifically, we see in the following heatmap a dozen red areas instead of only one. By separating out just one portion of the data (the registered users), the analyst uncovers the important information that will lead him to better decisions about how to improve the website.
Now let’s take a look at a more complex example that demonstrates how digging deeper than aggregations can reveal important (and surprising) insights.
A Customer Analytics Example
Since many of our clients are Internet firms, we have come across plenty of examples in this field where giraffes in the data can lead to poor decision making. And it’s not always easy to even know that a giraffe is lurking in your data, leading you astray.
Marketers for a particular iGaming company wanted to improve their customer acquisition efforts by focusing on the most lucrative customer segments. Naturally, one of the dimensions they considered was the gender of their players. A top-level aggregation of their data clearly showed that male players had a 39% higher customer lifetime value (LTV) than female players (the data has been simplified for the sake of this article):
The obvious conclusion of this analysis is to focus more resources on acquiring male players than on female players. This, however, would be a mistake because actually, female players have a higher LTV in every country! This is obvious when looking at the numbers sliced by country, where female LTV is double that of male LTV in every country:
More than simply hiding insights, this aggregation actually led to an incorrect conclusion. How can this be? This situation exists because of two factors: the large discrepancy in the number of male/female players in the different countries and the large discrepancy in LTV from country to country. The following table shows the gender breakdown by country (the percentage figures refer to the distribution of customers in each row).
In this case, the UK represents a huge giraffe lurking in the data – the much larger LTV of this country’s players combined with the reverse proportion of male/female players (as compared with RU and US). By drilling down a bit and looking at each country individually, the marketers were able to discover the ideal course of action.
While this kind of situation is admittedly unusual, it is an excellent demonstration of a hard-to-spot giraffe in the data. By the way, the paradoxical situation in which a reverse trend appears in aggregated data, as in this example, is known as Simpson’s paradox.
In this blog post, I’ve tried to direct your attention to the fact that there are often giraffes in your data. These giraffes can hide important insights and can even lead to erroneous strategic decision making. The handful of examples here are only the tip of the iceberg; there are many more ways that aggregated data can hide insights and mislead marketers and analysts. Other common examples of giraffes that immediately come to mind are:
- Understand the true effectiveness of your SEO efforts by eliminating all traffic due to searches which included your brand names.
- Make sure that data on the majority of e-commerce customers – one-time purchasers – is not concealing important insights regarding the more valuable – repeat – customers.
- Make sure that data on the 40% of iGaming players who churn after their first 24 hours is not leading you to incorrect conclusions about where the most valuable players are acquired.
In short, I strongly encourage marketers and analysts to dig down into their data, to look out for misleading dominant portions of the data, and not to rely only on high-level, aggregated views. Beware the giraffes in your data!
A variation of this post appeared on GigaOm on August 24, 2013.
What difference does it make anyway? When the media needs data to backup a point they intentionally throw in the “giraffe” if it helps make the case.
Hi Mike. Yes, you’re absolutely right that it is easy to manipulate data. As the old expression goes: statistics don’t lie, statisticians do! However, the audience of this post is data analysts who want to be able to get the clearest, most objective and most actionable picture from their data. Since “giraffes” can hide important insights (or even lead to erroneous conclusions) , my point is that it’s important for one analyzing data to seek out any potential giraffes and to remove them.
Good work. Thanks.