This is the program for the 2010 Joint Statistical Meetings in Vancouver, British Columbia.
Abstract Details
Activity Number:
|
68
|
Type:
|
Topic Contributed
|
Date/Time:
|
Sunday, August 1, 2010 : 4:00 PM to 5:50 PM
|
Sponsor:
|
Section on Survey Research Methods
|
Abstract - #307067 |
Title:
|
Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects
|
Author(s):
|
William Winkler*+
|
Companies:
|
U.S. Census Bureau
|
Address:
|
4600 silver hill road, suitland , MD, 20746,
|
Keywords:
|
entity resolution ;
approximate string search ;
classification ;
record linkage ;
blocking ;
string comparator
|
Abstract:
|
There is an increased need to find duplicates in very large files. This paper details the current version of Bigmatch software (Yancey and Winkler 2007, 2009) that is sufficiently fast for processing 10^17 (=300 million x 300 million pairs) for the U.S. Decennial Census and even larger administrative-record situations with billions of records. The software, via a nontrivial application of a set of blocking strategies, is known to find more than 97.5% of true matches with very small error of less than 0.5% (Winkler 2004, 1995). It does detailed processing on 10^12 pairs using 40 cpus on an SGI Linux in 15 hours. The software is 40-50 times as fast as recent parallel software (Kim and Lee 2007; Kawai, Garcia-Molina, Benjelloun, Menestrina, Whang and Gong 2006).
|
The address information is for the authors that have a + after their name.
Authors who are presenting talks have a * after their name.
Back to the full JSM 2010 program
|
2010 JSM Online Program Home
For information, contact jsm@amstat.org or phone (888) 231-3473.
If you have questions about the Continuing Education program, please contact the Education Department.