Abstract:
|
In the analysis of data sets consisting of pairs (X_i,Y_i)^n, a tacit assumption is that each pair corresponds to the same observation unit. If, however, such pairs are obtained via record linkage of two files, this assumption is often violated due to mismatch error resulting from the absence of a common unique identifier. Recently, there has been a surge of interest in the problem under the term ”shuffled data” in which the underlying correct pairing of (X_i,Y_i)-pairs is represented via an unknown index permutation. Explicit modeling of this quantity tends to be associated with substantial overfitting. For the purpose of regularization, we consider a flexible exponential family prior on the permutation group that can be used to integrate various structures such as sparse and locally constrained shuffling. This prior is shown to be conjugate if the likelihood of the (X,Y)-pairs has an exponential family form, as in the case of generalized linear models. Inference is based on the EM algorithm in which the intractable E-step is approximated by the Fisher-Yates algorithm. Comparisons on synthetic and real data show that the proposed approach compares favorably to competing methods.
|