Online Program

Return to main conference page

All Times EDT

Thursday, September 23
Thu, Sep 23, 3:00 PM - 4:15 PM
Virtual
Machine Learning and Real-World Evidence Generation: Methodology, Validation, and Utility

Post-Prediction Inference for Clinical Characteristics Predicted with Machine Learning on Electronic Health Records (303541)

*Arjun Sondhi, Flatiron Health 

Real world evidence from electronic health records (EHRs) can be used to model population-level relationships between patient characteristics and cancer outcomes. Machine learning (ML) methods are commonly applied to EHR data such as unstructured clinical notes to predict variables that may be expensive or infeasible to measure directly. These predicted patient characteristics are then used in downstream epidemiological or statistical models as if they were observed. In this paper, we show that naively using predicted patient characteristics from EHRs in these models produces incorrect inferential results when compared to a gold standard of expert-abstracted patient characteristics. We introduce a post-prediction inference approach that uses labeled validation data to train a calibration model that captures the relationship between predicted and observed patient characteristics. The calibration model is then applied to unlabeled data to correct estimation and inference in the analytic model. We show calibration with this relationship model significantly improves statistical inference for patient characteristics extracted using ML, even in cases where the original ML model was trained on a different data distribution such as patients with a different tumor type. Using a real world cancer EHR database coupled with expert-abstracted clinical phenotypes, we demonstrate our post-prediction inference approach improves our estimates of the effects of treatment, metastatic status, and biomarker status on overall survival.