Online Program Home
  My Program

Abstract Details

Activity Number: 77 - Data, Linked Data, and Model-Based Analytics in Social Science
Type: Contributed
Date/Time: Sunday, July 30, 2017 : 4:00 PM to 5:50 PM
Sponsor: Social Statistics Section
Abstract #323732 View Presentation
Title: An Active Learning Approach to Record Linkage
Author(s): Kayla Frisoli* and Sam Ventura and Jared S Murray and Stephen Fienberg
Companies: Carnegie Mellon University and Carnegie Mellon University and Carnegie Mellon University and Carnegie Mellon University
Keywords: record linkage ; entity resolution ; classification ; active learning

Record linkage is the process of identifying records corresponding to unique entities across datasets. A common unsupervised approach to classify record-pairs as matches or non-matches is based on the Fellegi and Sunter (1969) mixture model framework. Significant gains in accuracy can be achieved by utilizing supervised learning methods, but these methods are rarely used because large, representative training data is expensive to obtain and difficult to create. To address these issues, we study an active learning approach to record linkage. We develop an R package that allows users to create their own optimized training data, by iteratively prompting the user to label record-pairs that increase the predictive power of the resulting classifier. Since some human labeling errors are likely, we assess how the proportion of incorrect user labels influences our results. We find that by using our approach we achieve lower error rates than unsupervised approaches, without having to build an unrealistically large, expensive dataset. Our results come from the linkage of multiple sets of death records from the Syrian Civil War.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2017 program

Copyright © American Statistical Association