Name: 2020 Joint Statistical Meetings
Start: 2020-08-02T07:00:00+00:00
End: 2020-08-06

Online Program Home
My Program

All Times EDT

Activity Number:	214 - Contributed Poster Presentations: Quality and Productivity Section
Type:	Contributed
Date/Time:	Tuesday, August 4, 2020 : 10:00 AM to 2:00 PM
Sponsor:	Quality and Productivity Section
Abstract #313770
Title:	A Variation on the Hit-Miss Model for Data Deduplication
Author(s):	Bryan Ek* and Lucas Overbey and Emily Nystrom and Chris Williams
Companies:	NIWC Atlantic and Naval Information Warfare Center Atlantic and Naval Information Warfare Center Atlantic and Naval Information Warfare Center Atlantic
Keywords:	deduplication; cleaning; aggregation; hit-miss; correlation; string distance
Abstract:	The use of data from multiple information sources is common in many real world applications. As a result, both the aggregation of multi-sourced data and the handling of aggregate data have been discussed in the statistical literature for decades. This report describes a novel approach to deduplication of records from aggregate compiled data. Adapted from Noren's hit-miss model [Data Mining and Knowledge Discovery, 2007], our approach has two main contributions. First, we extended the binary match/mismatch treatment of strings, by incorporating a string distance, which allows for a more granular quantification of string similarity. Second, we improved computational speed of the algorithm by implementing an alternative method to account for correlations between fields.

Authors who are presenting talks have a * after their name.

JSM 2020 Online Program