Capture-Recapture Techniques to Evaluate Completeness of Administrative Health Databases for Chronic Disease Research: Effects of Misclassification Error
*Lisa M. Lix, School of Public Health, University of Saskatchewan 
Xiaojing Wu, University of Saskatchewan 

Keywords: measurement error, diagnoses, capture-recapture, assumption violations

Background: Administrative health databases are a valuable resource to study health outcomes in chronic disease populations. Unfortunately, diagnosis codes in administrative databases may have less than perfect specificity, sensitivity, and/or positive predictive value (PPV). Capture-recapture (CR) techniques can be used to estimate the completeness of administrative databases for ascertaining chronic disease cases. However, many CR techniques rest on assumptions that may not be satisfied in practice, including the assumption of accurate classification of disease cases. Purpose and Objectives: The purpose is to develop and compare CR techniques to adjust for misclassification error when estimating completeness of administrative databases. The objectives are to investigate the following two-source CR techniques: (a) Chao lower-bound estimator, (b) Chao lower bound estimator with adjustment for misclassification error, (c) multinomial logistic regression, (d) multinomial logistic regression with adjustment for misclassification error. Methods: Computer simulation was used to compare the techniques; simulation parameters were based on analyses of existing databases. Correlated binary data were generated for PPVs of 55%, 80%, or 100% in source 1 and a PPV of 100% in source 2. Covariates associated with heterogeneity of capture probability were generated assuming a normal or multinomial distribution. Estimates of completeness of administrative databases were adjusted for misclassification error based on prior information about PPV. Uncertainty in the estimates was modeled using the posterior distribution of PPV. Bias and mean square error (MSE) of the estimates were based on 1000 replications. Results: None of the CR techniques was robust to dependence of the data sources and heterogeneity of capture probabilities. When the PPV for source 1 was 100%, all techniques resulted in negatively biased estimates (i.e., underestimates of completeness) with values ranging from -.03 to -.24 for Chao’s estimator and -.06 to -.26 for multinomial logistic regression. Chao’s estimator was less sensitive to high correlation between data sources than the multinomial logistic model. Simulations for other values of PPV are in progress. Conclusions: CR techniques to minimize the effects of assumption violations on estimates of completeness should be routinely adopted when evaluating the quality of administrative health databases for chronic disease research.