Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 6 - Learning from Permuted Data and the Analysis of Linked Files
Type: Invited
Date/Time: Monday, August 3, 2020 : 10:00 AM to 11:50 AM
Sponsor: Government Statistics Section
Abstract #309270
Title: Dblink: Distributed End-to-End Bayesian Entity Resolution
Author(s): Rebecca C. Steorts* and Neil Marchant and Ben Rubinstein and Daniel Elzar and Andee Kaplan
Companies: Duke University and Melbourne and Melbourne and ABS and Colorado State University
Keywords: Entity resolution; record linkage; Bayesian methods; Markov chain Monte Carlo; distributed computing
Abstract:

Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not both scale to realistically-sized databases and incorporate probabilistic blocking in an end-to-end approach. In this paper, we propose ``distributed blink'' or dblink --- the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching, and merging. We conduct experiments on real and synthetic data which show that dblink can achieve significant efficiency gains --- in excess of 200 times ---when compared to existing methodology.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program