Online Program

Return to main conference page
Saturday, May 19
Data Science
Data Science in Practice
Sat, May 19, 10:30 AM - 12:00 PM
Grand Ballroom G

The SOBER Algorithm: How to Squeeze Out Huge but Sparse Data for Making Individual Predictions (304548)

Philipp Gaffert, GfK SE 
*Barbara Hildegard Wolf, GfK SE 

Keywords: Prediction, Online Advertisement, Online Usage Data, Sparsity, Predictive Dictionnary

Online advertisers’ business is to present ads to predefined target groups in order to maximize the advertising effect. Key here is to know who is sitting behind the screen. Due to the anonymity in the internet, the challenge is to populate cookies with user information such as gender, media consumption habits or (offline) purchase behavior.

This enrichment process heavily relies on machine-learning algorithms that provide accurate individual predictions (as in Bock & Poel 2009). The predictor variables are derived from online behavior data and consist of the visits of a large number of websites. Because most of the websites’ reaches are tiny, the predictor data are very sparse.

This extreme sparsity and large number of potential predictors causes established approaches to struggle. Algorithms like random forests result in long runtimes and poor performance. To avoid these issues, often a smaller number of features is extracted from the broad variety of available information. The big drawback of this is that a huge fraction of valuable information remains unused. Murray & Durell 2000 found that using a bigger amount of websites increases the accuracy substantially.

As a way out of this dilemma, we extended the ideas from Hu et al. 2007 resulting in the machine learning algorithm SOBER (Squeezing Online BEhavioR). In this talk we explain the mechanics of the algorithm and present its performance relative to established alternatives. Our validation data set stems from the market research company GfK and comprises the online behavior data of more than 15.000 persons in Germany. We even show how to use websites as predictors that are never observed in the training data by making use of the words shown on the respective websites.

Conclusively, we find that SOBER is able to reach the same level of accuracy as commonly used machine learning methods while increasing speed tremendously. This makes the typical runtime versus quality trade-off disappear.