Abstract:
|
Expectation is high for individualized risk prediction with the accumulated knowledge of the gene-disease association. While mega data linking the biobanks and EHR may have the enough sample size for learning thousands of genetic factors, retrieving the exact disease onset information requires labor-intensive chart review, which limits the available of such gold-standard label to only a fraction of the subjects. In this paper, we develop a semi-supervised learning (SSL) method for prediction and inference of individual risk when the number of covariates far exceeds the number of labels without the typical sparsity assumption on the prediction model. We leverages the predictive power from a few predictive surrogates on the missing labels so that we may predict individual risk involving parameters up to the full sample size. Through the one-step bias-correction with a novel cross-fitting scheme, we are able to produce honest SSL confidence interval for individual risk with arbitrary loading. We demonstrate the superiority of our SSL approach compared to existing supervised methods in simulation. We apply the method to the predict individual risk of obesity using SNP.
|