Abstract:
|
Modern electronic health records systems routinely collect variables of clinical interest. However, responses and predictors can be captured with error, and these discrepancies can be correlated. A cost-effective solution to a complete data audit is the two-phase design. During Phase I, error-prone variables are observed for all subjects, and this information is then used to select a validation subsample in Phase II. Previous corrections are limited to misclassified, binary predictors, make distributional assumptions about the error mechanisms, or rely on a validation subsample that is simple or stratified random. We propose a semiparametric approach to two-phase designs with a misclassified, binary outcome and error-prone predictors, allowing for dependent errors and arbitrary second-phase selection. We devise a computationally efficient and numerically stable EM algorithm to maximize the nonparametric likelihood function. The resulting estimators possess desired statistical properties. We demonstrate performance of the proposed method to existing approaches through extensive simulation studies and illustrate use in an observational HIV study.
|