Abstract:
|
The accurate and cost effective estimation of linkage errors remains a major challenge for the automated production and use of linked data. However this exercise is worthwhile only if the linked data are fit for use. A new model is proposed to estimate the errors without clerical reviews, training data or conditional independence assumptions, under regularity conditions that guarantee the fitness for use of the linked data. It is based on the number of records adjacent to a given record, when linking files that have few duplicate records and a nearly complete coverage of the target population. Additional benefits include the estimation of false negatives due to blocking criteria, as well as record level measures of errors; two challenges for previous models.
|