All Times ET
Keywords: Binary classification, Bias, Maximum Likelihood, Bayes' Theorem
An important property of methods applied in (official) statistics is that they provide unbiased estimators. However, this is not always the case for classification methods, especially when one wants to determine the (development of the) proportion of positives in an unknown population. In such cases a bias can affect the target variable tremendously. Classification models are especially sensitive to bias when the scores on the positives and the scores on the negatives overlap and there are many cases where one does not find a perfect separation between those two, for example, in Natural Language Processing and Image Processing. Standard bias correction methods assume that information is available on the bias in the data set. In particular case that we have an annotated data set, one has the opportunity to look at the confusion matrix or the ROC curve. Here, a maximum likelihood method is introduced that, based on unlabeled data, is able to correctly estimate the true proportion of positives by fitting the distributions of positive and negative items to the probability scores provided by the model. It does this, even when the distributions of the scores of positives and negatives overlap. In the paper, the method is explained, after which it is applied to two data sets: a simulated and the Banknote Authentication data set. The results show that the maximum likelihood method developed gives good, unbiased estimates for the proportion positives in binary classification problems. In the future, the method could be extended to multi class classifications. Since this also means that a maximum needs to be found in a multidimensional likelihood space, other methods, such as Markov Chain Monte Carlo sampling, may need to be incorporated.