Abstract:
|
The semi-supervised setting is widely present in today’s massive data repositories. A fundamental challenge therein lies in the disproportionality in the size of the fully observed data, n, and the data’s size with missing outcomes, with the latter being significantly bigger. An implicit understanding is that additional information ought to lead to an improved inference. However, in a semi-supervised setting, it is unclear to what extent this insight holds. We illustrate that a root-n inference concerning the outcomes mean is possible while only requiring a consistent estimation, possibly at a rate slower than root-n, of the outcome model. This solution especially suits models that naturally do not admit root-n consistency, such as high-dimensional, nonparametric, or semi-parametric models. The estimator uses a novel k-fold cross-fitting estimator and establishes connections between double robustness and semi-supervised learning.
|