Abstract:
|
Record linkage is, ultimately, a decision problem of declaring which compared pairs are matches. Here we view record linkage in terms of what decisions are possible, as software packages, algorithms and parameter settings are varied. We report the results of an experiment on two publicly available datasets for which "ground truth" is known, using six freely available packages. Our analyses focus on the resulting weights for 77,951 compared record pairs. Depending on parameter settings (for example, use of EM algorithms or the string matching method), the number of weights can vary by orders of magnitude. Therefore, the number of distinct sets of matches as a function of the threshold-that is, the space of possible decisions-also varies. In some instances, there is no threshold that correctly reproduces ground truth. In others, "correct" thresholds exist but are difficult to identify without knowledge of ground truth. The available decisions differ across software packages, even though the algorithms are purported to be identical; over parameter settings; and over often opaque implementation details such as treatment of missing values. We propose the use of ensemble decision rules.
|