Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 214 - Contributed Poster Presentations: Quality and Productivity Section
Type: Contributed
Date/Time: Tuesday, August 4, 2020 : 10:00 AM to 2:00 PM
Sponsor: Quality and Productivity Section
Abstract #313770
Title: A Variation on the Hit-Miss Model for Data Deduplication
Author(s): Bryan Ek* and Lucas Overbey and Emily Nystrom and Chris Williams
Companies: NIWC Atlantic and Naval Information Warfare Center Atlantic and Naval Information Warfare Center Atlantic and Naval Information Warfare Center Atlantic
Keywords: deduplication; cleaning; aggregation; hit-miss; correlation; string distance
Abstract:

The use of data from multiple information sources is common in many real world applications. As a result, both the aggregation of multi-sourced data and the handling of aggregate data have been discussed in the statistical literature for decades. This report describes a novel approach to deduplication of records from aggregate compiled data. Adapted from Noren's hit-miss model [Data Mining and Knowledge Discovery, 2007], our approach has two main contributions. First, we extended the binary match/mismatch treatment of strings, by incorporating a string distance, which allows for a more granular quantification of string similarity. Second, we improved computational speed of the algorithm by implementing an alternative method to account for correlations between fields.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program