Abstract:
|
Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not both scale to realistically-sized databases and incorporate probabilistic blocking in an end-to-end approach. In this paper, we propose ``distributed blink'' or dblink --- the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching, and merging. We conduct experiments on real and synthetic data which show that dblink can achieve significant efficiency gains --- in excess of 200 times ---when compared to existing methodology.
|