Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 203 - Contemporary Machine Learning
Type: Contributed
Date/Time: Tuesday, August 4, 2020 : 10:00 AM to 2:00 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #309809
Title: Risk Minimization Under Sampling Bias Arising from Customer Interactions
Author(s): Scott Rome* and Michael Kreisel
Companies: Comcast and Comcast
Keywords: data science; statistical learning; machine learning; sampling bias; risk minimization; biased sample

In studying machine learning classifiers, researchers often assume that training and testing data are sampled at random from the same distribution. One way this assumption fails in practice is that training samples are biased, yielding training data drawn from a conditional distribution. rather than the true joint distribution of the data and label. In this paper, we consider the case of a call center where we are only able to collect the label when a customer contacts us. This leads to a biased sampling model which depends on feature data only for positive labels. This sampling model is applicable to survey statistics and particularly data generated by voter surveys. By identifying a formal model for the sampling bias, we prove a generalization bound on the empirical risk of the optimal classifier trained on the sampling distribution and characterize the tightness of this bound by the level of dependency between the sampling and the label and the empirical risk of the optimal classifier on the full distribution.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program