Abstract:
|
The Fellegi-Sunter record linkage paradigm in its original conception was based on the idea that for a set of comparison fields, such as first name, year of birth, and state of residence, agreement of each field between records in a pair is strictly binary: either there is complete agreement or there is not. For string comparisons, particularly for names fields, intuition tells us that having two versions of a name (e.g. ‘Resnick’ compared to ‘Reznik’) that are very similar but not identical is more indicative of a record pair being a match rather than a non-match. There are several string comparison tools such as Jaro-Winkler similarity scores and Levenshtein distances that can quantify the level of agreement as a full range of values between complete agreement and complete non-agreement. Certainly, one way of using such a metric is to establish a cutoff level above which we consider the fields essentially in agreement, but this would require a method of determining the cutoff. However, we are instead looking for a way to assess several gradations of agreement for string comparisons and assign agreement and non-agreement weights corresponding to the observed gradation. In this paper, we describe such a method that maintains and expands upon the Fellegi-Sunter approach.
|