Abstract:
|
Electronic Health Record (EHR) databases are an increasingly valuable resource for observational studies. However, misclassification of EHR-derived outcomes due to imperfect phenotyping leads to bias in association studies, as well as inflated type 1 error and reduced power. On the other hand, manual chart-review to validate outcomes is both cost-prohibitive and time-consuming, and a randomly selected validation sample may not yield sufficient cases when the disease is rare. Sampling procedures have been developed for maximizing efficiency in settings where the true disease status is known. However, less work has been done in measurement constrained settings, particularly for severely imbalanced data. Motivated by this gap, we propose a two-stage sampling algorithm to optimally guide cost-effective chart review in measurement constrained settings. We validate our method through simulation study and show that it is robust to differential misclassification, imbalanced data, and various covariate distributions. We then apply our sampling method to a real world dataset with biomedical applications.
|