Abstract:
|
There has been increasing demand in establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion in computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions for standard statistical inference, yet nontrivial upstream pre-processing is typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more datasets of the same group of entities without the unique identifier. We adopt a secondary perspective regarding record linkage, meaning that the linked data is ready and we have access to information about linkage quality. We present a new method for differentially private linear regression when the explanatory and response variables are linked with errors. In particular, we propose a noisy gradient method and provide the finite sample risk bounds on the corresponding estimation error, which allows us to understand and adjust the relative contributions of linkage error, estimation error, and the cost of privacy.
|