Abstract:
|
In the context of archival research, where data is derived from often-messy digitized text, adopting a statistically sound approach for linkage estimation is essential. Probabilistic record linkage, the process of assigning probabilities to whether two entries correspond to the same entity, allows for approximately unbiased estimation of quantities of interest while allowing for imperfect identification of matches. Bayesian approaches to record linkage are among the most accurate, but computational considerations severely limit the practical applicability of existing methods.
We introduce a new computational approach, providing both a fast algorithm for deriving point estimates that properly account for one-to-one matching and a restricted MCMC algorithm that samples from an approximate posterior distribution. These advances make it possible to perform Bayesian inference for much larger problems. We demonstrate the methods on an OCR'd dataset, the California Great Registers, a collection of 57 million voter registrations from 1900 to 1968 that comprise the only panel data set of party registration collected before the advent of scientific surveys.
|