Abstract:
|
Consider a regression problem where for some part of the data both the continuous label variable (Y) and the covariate predictors (X) are observed, while for the other part of the data only the predictors are observed. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent but observations of the covariate predictors are cheap and plentiful. If the conditional expectation E(Y|X) is exactly linear in X then typically the additional observations of the X's do not contain useful information. Otherwise the unlabeled data can be informative. We suggest improved alternative estimates to the naive standard procedures that depend only on the labeled data. The estimation method can be easily implemented and has simply described asymptotic properties. The new estimates asymptotically dominate the usual standard procedures. The practical performance of the new estimator is investigated in a simulation study and a real data example. This work builds on the assumption-lean, random-covariate framework of Buja, et. al. (2014-6) and also earlier work of Zhang, Brown and Cai (2016).
|