Abstract:
|
Record linkage is the process of identifying records corresponding to unique entities across datasets. A common unsupervised approach to classify record-pairs as matches or non-matches is based on the Fellegi and Sunter (1969) mixture model framework. Significant gains in accuracy can be achieved by utilizing supervised learning methods, but these methods are rarely used because large, representative training data is expensive to obtain and difficult to create. To address these issues, we study an active learning approach to record linkage. We develop an R package that allows users to create their own optimized training data, by iteratively prompting the user to label record-pairs that increase the predictive power of the resulting classifier. Since some human labeling errors are likely, we assess how the proportion of incorrect user labels influences our results. We find that by using our approach we achieve lower error rates than unsupervised approaches, without having to build an unrealistically large, expensive dataset. Our results come from the linkage of multiple sets of death records from the Syrian Civil War.
|