Abstract:
|
In studying machine learning classifiers, researchers often assume that training and testing data are sampled at random from the same distribution. One way this assumption fails in practice is that training samples are biased, yielding training data drawn from a conditional distribution. rather than the true joint distribution of the data and label. In this paper, we consider the case of a call center where we are only able to collect the label when a customer contacts us. This leads to a biased sampling model which depends on feature data only for positive labels. This sampling model is applicable to survey statistics and particularly data generated by voter surveys. By identifying a formal model for the sampling bias, we prove a generalization bound on the empirical risk of the optimal classifier trained on the sampling distribution and characterize the tightness of this bound by the level of dependency between the sampling and the label and the empirical risk of the optimal classifier on the full distribution.
|