Abstract:
|
Analysis of modern biomedical data is often complicated by the presence of missing values. When variables of interest are missing for some subjects, it is desirable to use observed auxiliary variables, which are sometimes high-dimensional, to impute or predict the missing values to improve statistical efficiency. Although many methods have been developed for prediction using high-dimensional variables, it is challenging to perform valid inference based on the predicted values. In this paper, we develop an association test for an outcome variable and a potentially missing covariate, where the covariate can be predicted using selected variables from a set of high-dimensional auxiliary variables. We establish the validity of the test under data-driven model selection procedures. We demonstrate the validity of the proposed method and its advantages over existing methods through extensive simulation studies and provide an application to a major cancer genomics study.
|