This is the program for the 2010 Joint Statistical Meetings in Vancouver, British Columbia.

Abstract Details

Activity Number: 68
Type: Topic Contributed
Date/Time: Sunday, August 1, 2010 : 4:00 PM to 5:50 PM
Sponsor: Section on Survey Research Methods
Abstract - #307067
Title: Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects
Author(s): William Winkler*+
Companies: U.S. Census Bureau
Address: 4600 silver hill road, suitland , MD, 20746,
Keywords: entity resolution ; approximate string search ; classification ; record linkage ; blocking ; string comparator
Abstract:

There is an increased need to find duplicates in very large files. This paper details the current version of Bigmatch software (Yancey and Winkler 2007, 2009) that is sufficiently fast for processing 10^17 (=300 million x 300 million pairs) for the U.S. Decennial Census and even larger administrative-record situations with billions of records. The software, via a nontrivial application of a set of blocking strategies, is known to find more than 97.5% of true matches with very small error of less than 0.5% (Winkler 2004, 1995). It does detailed processing on 10^12 pairs using 40 cpus on an SGI Linux in 15 hours. The software is 40-50 times as fast as recent parallel software (Kim and Lee 2007; Kawai, Garcia-Molina, Benjelloun, Menestrina, Whang and Gong 2006).


The address information is for the authors that have a + after their name.
Authors who are presenting talks have a * after their name.

Back to the full JSM 2010 program




2010 JSM Online Program Home

For information, contact jsm@amstat.org or phone (888) 231-3473.

If you have questions about the Continuing Education program, please contact the Education Department.