Abstract:
|
Probabilistic record linkage, i.e., the identification of matching records in multiple files can be a challenging and error-prone task. Linkage error can considerably affect subsequent analysis based on the resulting linked file. Several recent papers have studied post-linkage linear regression analysis with the response variable Y in one file and the covariates X in a second file from the perspective of “broken sample problem” and “permuted data”. In this work, we present an extension of this line of research to generalized linear models under the assumption of a small to moderate number of mismatches. An approach based on dummy variables and 1-norm penalization is proposed, and non-asymptotic error bounds for estimating the regression parameters are derived. For selected models, we also state conditions under which the underlying permutation can be recovered, i.e., under which the correct correspondence between X and Y can be restored.
|