Abstract:
|
Electronic health records (EHRs) are increasingly used for public health surveillance. As routinely collected data, EHRs offer a less expensive and fast alternative to national surveys and registries. However, use of EHR data to estimate finite population parameters such as disease prevalence requires great care. Data collected in EHRs are convenience samples that are not selected at random. Selection depends on various factors, including demographics, health status, and health care referral patterns. Moreover, conditional on inclusion in the EHR, data completeness and quality may influence the construction of the analysis dataset. To illuminate potential sources of selection bias, we describe the process of identifying an EHR cohort based on diagnosis codes and available encounter data. Relevant to surveillance, we probe challenges with capturing race and ethnicity data in EHRs, such as missing values and misclassification, which may result in misleading inferences. Finally, we present model and weight-based correction methods to address non-representativeness of the EHR sample with respect to the target population for which inference is desired.
|