Read Time 7 minutes

Posted in

Data Analysis

What is Data Preparation and How Can it Help?

Data preparation is the manipulation of data into a form suitable for further processing and analysis. It’s a demanding and labor-intensive stage that involves many different kinds of tasks and cannot be fully automated. It’s estimated that data preparation accounts for 60%-80% of the time spent on data mining projects.

The technical preparation of the data is only part of this phase. Obviously, it’s mandatory to have clean and pretty data – to arrange the data in specific formats, to fill in missing data entries, and reach some minimal level of data quality. Otherwise, it’s garbage in, garbage out. But the true significance of data preparation lies in the early insights that will manifest themselves in smarter, more contextual model results. Without careful and thoughtful analysis, a data preparation project might as well not occur.

Realizing Great Potential from Data Prep

One of the key contributions that can be made to the data during the preparation phase is smart transformations to the key predictors (independent variables). We sometimes forget that we want to exhaust the potential of the variables. This is a golden opportunity to create value, and if we fail to see this, we might miss highly important aspects, and also lose meaningful prediction power.

In their default form, many of our variables have some given, limited level of information within them. Our job is to make them more informative. In other words, we want to squeeze more information out of them. We want to help them tell the story they want to tell, but sometimes have a limited ability to do so on their own.

One common way to achieve this is by using data transformations. Transformations are a very effective instrument by which we can help the data become more informative and better tell its story.

I’ll try to illustrate this point using an e-commerce example. We often take a customer’s number of orders as a predictor of LTV, but the gap between a customer’s first and second orders has a much larger impact. This reflects a much more critical factor in predicting a customer’s future activity than the gap between, say, orders 5 and 6, or 11 and 12. The absolute difference in both cases is the same (one order), but the first situation tells us much more than the others.

In this case, we can achieve better effectiveness from our data if we use a logarithmic transformation on the field.

Moreover, just because a variable naturally has some specific scale (such as dollars or days), that doesn’t mean that we should stick with that scale, or that that scale is necessarily the most informative one. I often find that common sense and a basic understanding of the business can be translated into some simple arithmetic actions that eventually make a big difference.

Another example: we can combine several variables that have different scales into one factor that reflects the value of all of them. A classic case of this is RFM (recency, frequency, monetary) segmentation, where we combine three continuous variables with different scales into one discrete variable, yielding greater efficiency and less redundancy.

In all cases, the key is to step back from the data, looking at the big picture and integrating business considerations into the data. A well-known saying is that, “although we often hear our data speaking, its voice can be soft.” Smart transformation during data prep can make your data’s voice loud and clear.

Room of Influence

To sum up, data prep shouldn’t be viewed as a burden, but rather as an essential stage and a golden opportunity for smartening up your data. Beyond the technical aspects, which are mandatory, an important challenge is to get the data to be as valuable and informative as possible. This is where the data scientist has a very large room of influence, and where his abilities are put to the test.

Do this process well and you can save time and reduce model complexity later. Often, good transformations that rely on a solid understanding of the data will enable you to use simpler models to solve a given problem, which will save many iterations and minor calibrations down the road. So, invest time and energy here, and it will pay big dividends later!

A version of this article appears at CIOReview.

Published on May 26th, 2016

Posted in

Data Analysis

Yohai Sabag

Yohai heads up Optimove’s data lab. He is a top-tier data guru with extensive experience applying the fields of business intelligence and advanced data analytics to practical business challenges. Yohai holds a master’s degree in machine learning and information systems.

Related Blogs

The O5 Hack: How Optimove Reduced Emails While Sharing Knowledge and Transparency

As any leader of a growing company can attest, it gets exponentially more difficult to share knowledge as you add new people, open new offices and introduce new team structures. I’ve experienced this as the founder and CEO of a company that grew from a two-person team in Tel Aviv to a 200-person operation with … Continued

23 August 2018 | Pini Yakuel

Leverage Player Emotions to Increase Retention

At its heart, real-money gaming is an emotional experience. In that suspenseful moment right before finding out whether you’ve won or lost, your pulse quickens, your palms sweat, your mind focuses. The thrill of winning releases endorphins and creates a feeling of near euphoria, while the agony of a loss causes frustration and disgruntlement. For … Continued

30 June 2015 | Moshe Demri

If CDP and MMH Are Not Born Together, It Can Ruin the Customer Experience

Why it matters: Marketers must prioritize aligning a Customer Data Platform (CDP) with a Multichannel Marketing Hub (MMH) to enhance Customer-Led Marketing. The article warns of risks due to disconnected data, such as fragmented journeys and inefficient campaigns. It emphasizes the impact of misfired messages on customer experiences, urging marketers to adopt integrated CDP/MMH solutions … Continued

26 October 2023 | Rob Wyse

Get a personalized tour of Optimove

Let us show you how to go from tens to hundreds of segments

Request a Demo

Data Preparation: Time to Smarten Up Your Data

Transformations to key predictors and variables in your data add accuracy and context to your predictive model.

What is Data Preparation and How Can it Help?

Realizing Great Potential from Data Prep

Room of Influence

Yohai Sabag

Related Blogs

The O5 Hack: How Optimove Reduced Emails While Sharing Knowledge and Transparency

Leverage Player Emotions to Increase Retention

If CDP and MMH Are Not Born Together, It Can Ruin the Customer Experience

Get a personalized tour of Optimove

Data Preparation: Time to Smarten Up Your Data

Transformations to key predictors and variables in your data add accuracy and context to your predictive model.

What is Data Preparation and How Can it Help?

Realizing Great Potential from Data Prep

Room of Influence

Yohai Sabag

Get new blogs directly in your inbox

Related Blogs

The O5 Hack: How Optimove Reduced Emails While Sharing Knowledge and Transparency

Leverage Player Emotions to Increase Retention

If CDP and MMH Are Not Born Together, It Can Ruin the Customer Experience

Get a personalized tour of Optimove