Online Program Home
My Program

Abstract Details

Activity Number: 12 - Novel Statistical Methods for Analyzing Electronic Health Records and Biobank Data
Type: Invited
Date/Time: Sunday, July 29, 2018 : 2:00 PM to 3:50 PM
Sponsor: WNAR
Abstract #325464 Presentation
Title: Enabling Phenotypic Big Data with PheNorm
Author(s): Sheng Yu* and Yumeng Ma and Jessica Gronsbell and Tianrun Cai and Ashwin Ananthakrishnan and Vivian Gainer and Susanne Churchill and Peter Szolovits and Shawn Murphy and Isaac Kohane and Katherine Liao and Tianxi Cai
Companies: Tsinghua University and Tsinghua University and Harvard T.H. Chan School of Public Health and Brigham and Women's Hospital and Massachusetts General Hospital and Partners HealthCare and Harvard Medical School and Massachusetts Institute of Technology and Partners HealthCare and Harvard Medical School and Brigham and Women's Hospital and Harvard T.H. Chan School of Public Health
Keywords: phenotyping; natural language processing; electronic health records; biobank

EHR-based phenotyping infers whether a patient has a disease based on the information in their electronic health records. A human annotated training set with disease status labels is usually required to build a classification algorithm. The time intensiveness of annotation as well as feature curation severely limits the ability to achieve high-throughput phenotyping. Previous studies have successfully automated feature curation. In this talk, we present PheNorm, a phenotyping algorithm that does not require expert-labeled samples for training. PheNorm transforms predictive features, such as the number of ICD-9 codes or mentions of the target phenotype, to resemble a normal mixture distribution. The transformed features are then denoised and combined into a score for accurate classification. We validated the accuracy of PheNorm with four phenotypes: coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis. The AUC of the PheNorm score reached 0.90, 0.94, 0.95, and 0.94 for the four phenotypes, respectively, which were comparable to the accuracy of supervised algorithms trained with sample sizes of 100-300, with no statistically significant difference.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program