Abstract:
|
In semi-supervised settings, the available data consists of a small or moderate sized labeled data and a much larger unlabeled data, a frequent scenario in modern studies involving large databases like electronic medical records (EMR), where the outcome, unlike the covariates, is often expensive to obtain. Supervised linear regression estimators like OLS use only the labeled data. It is often of interest to investigate if and when the unlabeled data can be used to improve estimation in the adopted model. We propose a class of Efficient and Adaptive Semi-Supervised Estimators (EASE), 2-step estimators with improved efficiency compared to OLS under model mis-specification and equal (optimal) efficiency when the model holds. This adaptive property is crucial for advocating safe use of unlabeled data. Construction of EASE involves a flexible semi-non-parametric imputation and a follow up `refitting' step along with a cross-validation strategy that address under-smoothing and over-fitting, issues often encountered in smoothing based 2-step estimators. We establish our claims through theoretical results followed by validation through extensive simulations and application to an EMR study.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.