Online Program Home
My Program

Abstract Details

Activity Number: 354 - Topics in Machine Learning
Type: Contributed
Date/Time: Tuesday, July 31, 2018 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #329466 Presentation
Title: Model-Based Electronic Health Records Phenotyping from Only Positive and Unlabeled Data
Author(s): Lingjiao Zhang* and Naveen Muthu and Xiruo Ding and Daniel S Herman and Jinbo Chen
Companies: University of Pennsylvania and University of Pennsylvania and University of Pennsylvania and University of Pennsylvania and University of Pennsylvania
Keywords: EHR; Phenotype; Anchor; positive only; primary aldosteronism; Binary classifier

Learning a binary classifier for certain phenotype based on their Electronic Health Records (EHR) normally requires a curated dataset consisting of both positive and negative examples of the concept. However, for certain phenotypes the positive examples are relatively easy to obtain while the negative examples are either too expensive or infeasible to get. In this work, we present a model-based likelihood approach for discriminating cases vs. non-cases based on their EHR, with a small set of gold standard positive examples and a large set of unlabeled examples. We utilize the concept of anchor variables to identify the set of positive examples, where observing the anchor variables to be positive explicitly unveils the phenotype to be positive, while observing it to be negative is uninformative for the true phenotype. We choose a logistic regression model as our working model for the probability of phenotype presence, and provide efficient procedures for regression parameter estimation. We do extensive simulation studies to show estimation consistency, efficiency and classification accuracy. We also apply the proposed method to identify patients with primary aldosteronism in UPHS.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program