Abstract:
|
Learning a binary classifier for certain phenotype based on their Electronic Health Records (EHR) normally requires a curated dataset consisting of both positive and negative examples of the concept. However, for certain phenotypes the positive examples are relatively easy to obtain while the negative examples are either too expensive or infeasible to get. In this work, we present a model-based likelihood approach for discriminating cases vs. non-cases based on their EHR, with a small set of gold standard positive examples and a large set of unlabeled examples. We utilize the concept of anchor variables to identify the set of positive examples, where observing the anchor variables to be positive explicitly unveils the phenotype to be positive, while observing it to be negative is uninformative for the true phenotype. We choose a logistic regression model as our working model for the probability of phenotype presence, and provide efficient procedures for regression parameter estimation. We do extensive simulation studies to show estimation consistency, efficiency and classification accuracy. We also apply the proposed method to identify patients with primary aldosteronism in UPHS.
|