Abstract:
|
The problems of record linkage and duplicate detection have traditionally referred to distinct but related settings: record linkage referring to linking two data sources containing no duplicates, and duplicate detection referring to detecting which records in a single data source are duplicates. However, it’s common in practice to encounter data sources that fit somewhere in between or beyond these two settings. We propose a new probabilistic model for the general problem of joint record linkage and duplicate detection that can handle such settings. In particular, we build upon previous comparison based models and propose a prior on partitions that attempts to capture, in the context of record linkage, a generative process of partitions. We examine the performance of our model on simulated data and illustrate how we can accommodate settings outside of traditional record linkage and duplicate detection by linking data sources documenting human rights violations in El Salvador and homicides in Colombia.
|