Abstract:
|
The use of data from multiple information sources is common in many real world applications. As a result, both the aggregation of multi-sourced data and the handling of aggregate data have been discussed in the statistical literature for decades. This report describes a novel approach to deduplication of records from aggregate compiled data. Adapted from Noren's hit-miss model [Data Mining and Knowledge Discovery, 2007], our approach has two main contributions. First, we extended the binary match/mismatch treatment of strings, by incorporating a string distance, which allows for a more granular quantification of string similarity. Second, we improved computational speed of the algorithm by implementing an alternative method to account for correlations between fields.
|