Abstract:
|
Electronic health records (EHRs) and other routinely collected data are increasingly used for medical research. These data are prone to errors, often across multiple variables, and findings based on these data can be misleading. Data validation is sometimes performed in subsamples of records. Validated subsets are used to describe the sensitivity/specificity of phenotype-defining algorithms and to justify their use in the larger cohort. However, the information in the validated subset with respect to the error rates is rarely combined with the unvalidated data to account for uncertainty in variables and to improve precision. In addition, the choice of records to validate is often not carefully considered. This is a two-phase sampling problem: phase 1 is the error-prone EHR data and phase 2 is the validated subsample. We demonstrate ways to incorporate the validation data into the larger dataset to improve estimation. Through simulations guided by the two-phase sampling literature, we consider different approaches for selecting validation subsamples. We then demonstrate the efficiency of different sampling schemes using a fully validated EHR dataset of HIV-positive individuals.
|