Abstract:

We consider the regression of a binary outcome Y on a set of (possibly high dimensional) covariates X based on a large unlabeled data D with obs. only for X and additionally, a 'surrogate' S which, while not being strongly predictive of Y all over its support, can do so with high accuracy when it assumes extreme values. Such data arises naturally in settings where Y, unlike (X, S), is hard to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR), where an example of (Y, S) can be: (a disease outcome, its diagnostic codes). Assuming Y and S both follow flexible single index models vs. X, we show that under sparsity assumptions, we can recover the regression parameter of Y vs. X by simply fitting a least squares LASSO to the subset of D in the extreme sets of S with Y imputed using the surrogacy of S. We obtain sharp finite sample performance guarantees, with several interesting implications, for our estimator. We demonstrate the effectiveness of our approach through extensive simulations, where it is found to perform as well or better than supervised methods based on even 500 obs., followed by application to a real EMR dataset.
