Abstract:
|
Record linkage is a method used to link together multiple datasets, by matching records of unique entities (often individual survey respondents) who appear in more than one of the datasets. In other words, it identifies records that are shared across the datasets and connects them. In recent years, many ready-to-use record linkage software packages have entered the marketplace. One notable example is the R package fastLink (Enamorado, et al 2019), which won the Statistical Software award from the Society for Political Methodology. Here at NORC we have developed our own proprietary record linkage software, NorcLink. Like fastLink, NorcLink also uses the Felligi-Sunter method (1969), however NorcLink allows datasets to be hierarchically linked. We will be comparing the software packages in terms of precision and recall. Amongst the areas we will explore are how the two software handle pre-processing of string, blocking, string distance (with a focus on the Jaro-Winkler), performance and processing time, etc. We will also be exploring how the two software packages handle different sized datasets, missing data, and different types of variables including string and numeric variables.
|