Keywords: Statistical disclosure limitation, Distributed databases, Secure-multi party computation, Lasso regression, Data sharing
Integrating multiple databases that are distributed among different data owners can be beneficial in numerous contexts of biomedical research. But the actual sharing of data is often impeded by concerns about data confidentiality. A situation like this require tools that can produce correct results while preserving data privacy. In recent years, many "secure" protocols have been proposed to solve specific statistical problems such as linear regression and classification. However, factors such as the complexity of these protocols, inability to assess model fit, and the lack of a platform to handle necessary data exchange have all prevented them from actually being used in real-life situations. We present a practical approach to perform statistical analyses securely on data held separately by multiple parties, without actually combining the data. The main focus is on protocols in the vertically partitioned database setting and generalize linear models. Extensions to model-selection algorithms such as the Lasso will be introduced as well. Discussion on possible disclosure risks will be made so that users can decide on whether the approach is “secure” enough for their needs. We are cu