Data Preparation: Time to Smarten Up Your Data
Transformations to key predictors and variables in your data add accuracy and context to your predictive model.
What is Data Preparation and How Can it Help?
Data preparation is the manipulation of data into a form suitable for further processing and analysis. It’s a demanding and labor-intensive stage that involves many different kinds of tasks and cannot be fully automated. It’s estimated that data preparation accounts for 60%-80% of the time spent on data mining projects.
The technical preparation of the data is only part of this phase. Obviously, it’s mandatory to have clean and pretty data – to arrange the data in specific formats, to fill in missing data entries, and reach some minimal level of data quality. Otherwise, it’s garbage in, garbage out. But the true significance of data preparation lies in the early insights that will manifest themselves in smarter, more contextual model results. Without careful and thoughtful analysis, a data preparation project might as well not occur.
Realizing Great Potential from Data Prep
One of the key contributions that can be made to the data during the preparation phase is smart transformations to the key predictors (independent variables). We sometimes forget that we want to exhaust the potential of the variables. This is a golden opportunity to create value, and if we fail to see this, we might miss highly important aspects, and also lose meaningful prediction power.
In their default form, many of our variables have some given, limited level of information within them. Our job is to make them more informative. In other words, we want to squeeze more information out of them. We want to help them tell the story they want to tell, but sometimes have a limited ability to do so on their own.
One common way to achieve this is by using data transformations. Transformations are a very effective instrument by which we can help the data become more informative and better tell its story.
I’ll try to illustrate this point using an e-commerce example. We often take a customer’s number of orders as a predictor of LTV, but the gap between a customer’s first and second orders has a much larger impact. This reflects a much more critical factor in predicting a customer’s future activity than the gap between, say, orders 5 and 6, or 11 and 12. The absolute difference in both cases is the same (one order), but the first situation tells us much more than the others.
In this case, we can achieve better effectiveness from our data if we use a logarithmic transformation on the field.
Moreover, just because a variable naturally has some specific scale (such as dollars or days), that doesn’t mean that we should stick with that scale, or that that scale is necessarily the most informative one. I often find that common sense and a basic understanding of the business can be translated into some simple arithmetic actions that eventually make a big difference.
Another example: we can combine several variables that have different scales into one factor that reflects the value of all of them. A classic case of this is RFM (recency, frequency, monetary) segmentation, where we combine three continuous variables with different scales into one discrete variable, yielding greater efficiency and less redundancy.
In all cases, the key is to step back from the data, looking at the big picture and integrating business considerations into the data. A well-known saying is that, “although we often hear our data speaking, its voice can be soft.” Smart transformation during data prep can make your data’s voice loud and clear.
Room of Influence
To sum up, data prep shouldn’t be viewed as a burden, but rather as an essential stage and a golden opportunity for smartening up your data. Beyond the technical aspects, which are mandatory, an important challenge is to get the data to be as valuable and informative as possible. This is where the data scientist has a very large room of influence, and where his abilities are put to the test.
Do this process well and you can save time and reduce model complexity later. Often, good transformations that rely on a solid understanding of the data will enable you to use simpler models to solve a given problem, which will save many iterations and minor calibrations down the road. So, invest time and energy here, and it will pay big dividends later!
A version of this article appears at CIOReview.