Abstract:
|
Errors in data derived from electronic health records (EHR) are common, occur in multiple variables, and tend to be correlated. For example, if date of treatment start is incorrect in the EHR, then in a survival analysis, the time from treatment start to an event will be incorrect and many baseline covariates (e.g., lab values at the time of treatment start) are also likely incorrect. Such errors can substantially bias estimates. Although methods for addressing covariate measurement error are well-developed, methods that simultaneously account for errors both in covariates and time-to-event outcomes have not been developed. We propose an approach that uses models built in a validation sample to multiply impute the outcome and covariate values in the unvalidated records. We implement our approach using EHR data from an HIV dataset that was fully validated, allowing comparisons of our method with various validation sizes to the gold standard of full data validation. Our real data example illustrates the problems of using naïve, unvalidated EHR data as well as the promise and challenges of using validation samples together with multiple imputation techniques to address data errors.
|