Abstract:
|
Data sets subject to statistical disclosure limitation (SDL) often have many variables of different types that need to be altered to reduce disclosure risk. To produce a public data set with high utility, data protector needs to account for the relationships between the variables. Thus, ideally SDL methods should not be univariate, treating each variable independently, but multivariate, handling many variables at the same time. However, if a data set has hundreds of variables, as many government survey data do, the task of developing and implementing a multivariate approach for disclosure limitation becomes difficult. In this paper we propose a pre-masking data processing which consists of special type of clustering of variables in high dimensional data sets so that different groups of variables can be masked independently with minimal loss of data utility. By reducing the number of variables that have to be masked together the complexity of SDL reduces. The experimental results presented in the paper show good utility properties of our clustering approach.
|