Online Program Home
My Program

Abstract Details

Activity Number: 35
Type: Contributed
Date/Time: Sunday, July 31, 2016 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistics in Epidemiology
Abstract #320778
Title: Automated Feature Selection for Prediction with Electronic Medical Records Data
Author(s): Jessica Minnier* and Sheng Yu and Katherine Liao and Tianxi Cai
Companies: Oregon Health & Science University and Tsinghua University and Brigham and Women's Hospital and Harvard
Keywords: electronic medical records ; prediction ; phenotyping ; surrogate outcome ; variable selection ; medical informatics

The use of electronic medical records (EMR) for research is challenging due to imprecise coding practices and free form text fields. Natural language processing (NLP) methods can extract features from text but selecting informative features is not trivial. Furthermore, imprecise billing codes can lead to mismeasurement of disease outcomes. Often experts must manually review a subset of records to obtain a gold standard phenotype label. Models built on this data have limited prediction accuracy due to a high dimension of predictors and small sample size. We present an automated feature selection method that utilizes model-based clustering and regularized regression to build a prediction model with surrogate outcomes from EMR data, such as diagnosis codes and mentions of disease in text fields. Our method performs variable selection of NLP features and maintains high prediction accuracy even when labeled training data are unavailable. Our automated feature selection method minimizes the requirement of gold standard labels for algorithm training, thereby improving automated prediction and phenotyping efficiency.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

Copyright © American Statistical Association