Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 286 - Missing Data Methods
Type: Contributed
Date/Time: Wednesday, August 11, 2021 : 1:30 PM to 3:20 PM
Sponsor: Biometrics Section
Abstract #318142
Title: Positive Unlabeled Learning with Missing Data in Electronic Health Records
Author(s): Tanayott Thaweethai* and Caitlin Ann Selvaggi and Andrea Sarah Foulkes
Companies: Massachusetts General Hospital Biostatistics Center and Massachusetts General Hospital Biostatistics Center and Massachusetts General Hospital Biostatistics Center
Keywords: missing data; electronic health records; semi-supervised learning; positive unlabeled learning
Abstract:

Analyses utilizing electronic health records (EHR) frequently assume that the lack of recorded evidence of a condition is equivalent to its absence, falsely circumventing the challenge of missing data. This is often done because the design of EHR makes it difficult to distinguish missing data from the absence of disease, and traditional missing data methods were developed for the setting in which it is known which data are actually missing. Positive unlabeled learning is a type of semi-supervised learning where the units with “labels” (i.e., patients whose disease status is known) are designated as “positive” (have the disease), while the remaining units (“unlabeled”) may or may not have the disease. Under certain assumptions, the Expectation-Maximization (EM) algorithm can be used to estimate a classifier that distinguishes positive from negative unlabeled units on the basis of observed data only. We adapt this framework to the EHR setting and evaluate this procedure in a registry of patients hospitalized with COVID-19 at Massachusetts General Hospital in 2020, for whom EHR and manually chart-reviewed data are both available.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2021 program