Abstract:
|
Large scale digitization of medical records has facilitated an unprecedented opportunity to uncover health patterns in disease risk, progression, and classification. However, the performance of algorithms developed within a single site (e.g. a particular hospital or biobank) may be reduced when applied to other medical record databases. In this work, we demonstrate principles to optimize generalizability of EMR-based classification algorithm development, and provide an example based in identifying chronic obstructive pulmonary disease, a highly heterogeneous lung disease, among patients in the Partners HealthCare Systems Biobank. Furthermore, using a combination of two feature space inputs: disease relevant concepts contributed by medical professionals and surrogate-assisted feature extraction for high throughput phenotyping, we port our algorithm to a secondary validation site and compare performance to demonstrate the generalizability of our approach.
|