Abstract:
|
We propose a Bayesian approach for performing record linkage and inference across multiple lists,and simultaneously considering duplicate detection.
We frame the linkage problem as a clustering task, where similar records are clustered to true latent individuals. We propose a statistical model to incorporate both the linking and the inferential processes, including the features of the record as well as the variables needed for inference. Paramount to our approach is the key observation that the prior over the space of linkages can be written as a random partition model, and hence, can be used to calibrate the prior distribution regarding the cluster assignment of records. By the joint modeling of the record linkage and the inferential process, one is able to account for the matching uncertainty in the inferential procedures based on linked data. Moreover, one is able to generate a feedback mechanism of the information provided by the working statistical model on the record linkage process. This feedback mechanism is essential to eliminate potential biases that can jeopardize the resulting post-linkage inference. We apply our methodology to the case of multiple regression.
|