Abstract:
|
Many data sets, like surveys, are publicly available for analysis. Linking such public data sources to internal or private data sets allows richer analysis to be performed. Without common identifiers across the two files, linking often involves matching on a set of variables common to both files. However, data quality concerns, such as inaccurate field values or missing data, can hinder the linking process. We present a Bayesian file linking methodology designed to link records using continuous matching variables, called MVs, in situations where we do not expect values of these MVs to agree exactly across matched pairs. The method involves a linking model for the distance between the MVs of records in one file and the MVs of their linked records in the second. This model is conditional on a vector indicating the links. We specify a mixture model for the distance component of the linking model, as this latent structure allows the distance between matching variables in linked pairs to vary across types of linked pairs. Finally, we specify a model for the linking vector. We use the approach to link public survey information and data from the U.S. Census of Manufactures.
|