Abstract:
|
Labeling patients in electronic health records (EHRs) with respect to their statuses of having a disease or condition relies on prediction models using high-dimensional variables derived from EHR data. However, the most readily accessible annotations from EHRs are an incomplete set of gold-standard cases and non-gold standard cases. We analyze the "positive-only" data, where instead of observing the binary outcome directly, an anchor variable is observed as a proxy for the outcome. A positive anchor variable indicates presence of the phenotype, but a negative one is non-deterministic of the true phenotype status. We use high-dimensional logistic regression models for the golden-standard outcome and introduce a probability model between the outcome and the anchor variable. We propose a bias-corrected estimator for the case probability and establish asymptotic normality of the proposed estimator. Our method assumes sparsity conditions neither on the loading vector nor on the precision matrix of the random design. We validate our theoretical findings through simulations and real-data example.
|