Online Program

Return to main conference page

All Times EDT

Friday, June 5
Practice and Applications
Recent Advances in Entity Resolution
Fri, Jun 5, 11:15 AM - 12:50 PM
TBD
 

Bayesian Canonicalization of Voter Registration Files (308424)

*Andee Kaplan, Colorado State 
Rebecca Steorts, Duke University 

Entity resolution (record linkage or de-deduplication) is the process of merging noisy databases to remove duplicate entities in the absence of a unique identifier. One major challenge of utilizing linked data is identifying the canonical (or representative) records without duplicate information to pass to an inferential downstream task. The canonicalization step is particularly crucial after entity resolution, as a multi-stage approach allows for multiple analyses to be performed on the same linked data. While this approach can be scalable, the uncertainty from each stage of the entity resolution process is not naturally propagated throughout the pipeline and into the downstream task. In this talk, we present five fully unsupervised methods to choose canonical records from linked data, including a fully Bayesian approach which propagates the error from linkage through to the downstream inference. This multi-stage approach is illustrated and evaluated on simulated entity resolution data sets as well as voter registration data available from the North Carolina State Board of Elections (NCSBE). The NCSBE has released a snapshot of their voter registration databases regularly since 2005, providing a changing view of the voter registration information over time as new voters register, voters are dropped from the register, and voter information is updated. We compare the proposed canonicalization methods after performing entity resolution on five snapshots and examine the relationship between demographic information and party affiliation on the resulting canonical data sets.