Abstract:
|
Electronic Health Record (EHR) databases are an increasingly valuable resource for observational studies. However, misclassification of EHR-derived phenotypes due to imperfect algorithms can lead to bias, inflated type 1 error and reduced power in risk-factor association studies. On the other hand, manual chart-review to validate outcomes is cost- and time-prohibitive, and a randomly selected validation sample may not yield sufficient cases when the disease is rare. Sampling procedures have been developed for maximizing computational and statistical efficiency in settings where the true disease status is known. However, less work has been done in measurement constrained settings, particularly for severely imbalanced data. Motivated by this gap, we propose two surrogate assisted sampling algorithms to guide cost-effective chart review in measurement constrained settings. We compare our weights with existing approaches through simulations under various covariate distributions, differential misclassification rates and degrees of outcome imbalance. We then apply our proposed weighting schemes to a study of risk factors for second breast cancer events using a real EHR dataset.
|