Name: 2019 Joint Statistical Meetings
Start: 2019-07-27T07:00:00+00:00
End: 2019-08-01
Location: Colorado Convention Center

Abstract Details

Activity Number:	285 - Probabilistic Record Linkage and Inference with Merged Data
Type:	Topic Contributed
Date/Time:	Tuesday, July 30, 2019 : 8:30 AM to 10:20 AM
Sponsor:	Section on Statistics in Epidemiology
Abstract #302950	Presentation 1 Presentation 2 Presentation 3
Title:	Active Learning for Probabilistic Record Linkage
Author(s):	Ted Enamorado*
Companies:	Princeton University
Keywords:	Active Learning; Probabilistic Record Linkage; Scalability
Abstract:	Integrating information from multiple sources plays a key role in social science research. However, when a unique identifier that unambiguously links records is not available, merging datasets can be a difficult and error-prone endeavor. Probabilistic record linkage (PRL) aims to solve this problem by providing a framework in which common variables between datasets are used as potential identifiers, with the goal of producing a probabilistic estimate for the unobserved matching status across records. In this paper, I propose an active learning algorithm for PRL, which efficiently incorporates human judgment into the process and significantly improves PRL’s accuracy at the cost of manually labeling a small number of records. Using data from local politicians in Brazil, where a unique identifier is available for validation, I find that the proposed method bolsters the overall accuracy of the merging process. In addition, I examine data from a recent vote validation study conducted for the ANES, and I show that the proposed method can recover estimates that are indistinguishable from those obtained from a more extensive, expensive, and time-consuming clerical review.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program

JSM 2019 Online Program

Abstract Details

American Statistical Association