JSM 2013 Home
Online Program Home
My Program

Abstract Details

Activity Number: 572
Type: Invited
Date/Time: Wednesday, August 7, 2013 : 2:00 PM to 3:50 PM
Sponsor: Social Statistics Section
Abstract - #310441
Title: Deduplicating Text Records Using Clustering and Aggregation of Conditional Classifiers
Author(s): Samuel Ventura*+ and Rebecca Nugent
Companies: Carnegie Mellon University and Carnegie Mellon University
Keywords: record linkage ; deduplication ; classification ; clustering ; random forests
Abstract:

Traditional record linkage methods (Fellegi and Sunter, 1969) assume a one-to-one matching across two databases. Such methods cannot trivially be applied to deduplication, where each unique entity may be duplicated multiple times in a single database. Sadinle and Fienberg (2013) extend Fellegi-Sunter to L>2 databases, but this approach may not be computationally feasible for large-scale deduplication. Several authors propose using clustering to identify unique entities. Clustering methods traditionally identify a small number of large clusters, but in deduplication, most clusters are very small or singleton. We explore the use of clustering methods to identify unique entities using calculated pairwise distances from a novel classification technique that employs conditioning on informative features of the record-pairs. We apply our methodology to the identification of unique inventors in the United States Patent and Trademark Office database and demonstrate its effectiveness over more heuristic approaches. This methodology could be applied to multiple record linkage (e.g. linking the 2010 US Census, ACS, and CCM) by adjoining all L databases and adjusting the comparison rules.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2013 program




2013 JSM Online Program Home

For information, contact jsm@amstat.org or phone (888) 231-3473.

If you have questions about the Continuing Education program, please contact the Education Department.

The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.

ASA Meetings Department  •  732 North Washington Street, Alexandria, VA 22314  •  (703) 684-1221  •  meetings@amstat.org
Copyright © American Statistical Association.