![IconGems-Print](images/IconGems-Print.png)
203 – Contemporary Machine Learning
Risk Minimization Under Sampling Bias Arising from Customer Interactions
Scott Rome
Comcast
In studying machine learning classifiers, researchers often assume that training and testing data are sampled at random from the same distribution. One way this assumption fails in practice is that training samples are biased, yielding training data drawn from a conditional distribution p(x; y j s = 1) rather than the true distribution p(x; y): In this paper, we consider the case of a call center where we are only able to collect the label y when a customer contacts us. This leads to a biased sampling model which depends on x only when y = 1: This sampling model is applicable to survey statistics and particularly data generated by voter surveys. By identifying a formal model for the sampling bias, we prove a generalization bound on the empirical risk of the optimal classifier Æ’s trained on the sampling distribution and characterize the tightness of this bound by the level of dependency between s and y and the empirical risk of the optimal classifier Æ’* on the full distribution.