Online Program Home
My Program

Abstract Details

Activity Number: 30
Type: Contributed
Date/Time: Sunday, July 31, 2016 : 2:00 PM to 3:50 PM
Sponsor: ENAR
Abstract #321304
Title: Surrogate-Guided Sampling Designs for Biomedical Natural Language Processing with Rare Outcomes
Author(s): Wei Ling (Katherine) Katherine Tan* and Patrick Heagerty
Companies: University of Washington and University of Washington
Keywords: biased sampling ; Natural Language Processing ; Electronic Health Records ; study design ; machine learning ; epidemiological methods

Natural Language Processing (NLP) is increasingly used to derive patient status variables from unstructured text data. Supervised learning in NLP requires actual outcome statuses (1=case,0=control) for a development (training) and validation dataset. Most applications randomly sample a modest subset of reports for expert human annotators to manually label actual case statuses. However, for rare outcomes, random sampling may result in very few cases in the training set. Such outcome class imbalance results in limited information and potential for overfitting when considering many candidate text features. We propose a sampling design to enrich the training set for cases by sampling on a variable that is a weak surrogate for the outcome. Such a biased sampling design can approximate the benefits of case-control sampling even without access to the actual outcome status. Our design generates surrogate-based training sets that provides more information to improve classification performance. We discuss recommendations for selecting appropriate surrogates, and apply our methods for case status prediction in radiology reports collected from cohort studies of lower back pain.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

Copyright © American Statistical Association