Abstract:
|
Increasingly medical research is dependent on data collected for non-research purposes, such as electronic health records data (EHR). EHR data and other large databases can be prone to measurement error in key exposures. Validating a subset of records is a cost-effective way of gaining information on the error structure, which in turn can be used to adjust analyses for this error and improve inference. We extend the mean score method for the two-stage analysis of discrete-time survival models, which uses the unvalidated covariates as auxiliary variables that can act as surrogates for the unobserved true exposure. This method allows for a two-phase sampling analysis approach that preserves the consistency of the regression model estimates in the validated subset, with increased precision leveraged from the auxiliary data. Further, we develop optimal sampling strategies which minimize the variance of the mean score estimator for a target exposure under a fixed cost constraint. Through simulations, we evaluate efficiency gains of the mean score estimator using optimal validation designs compared to random sampling. We also apply the proposed method to the Wilms tumor study.
|