Online Program Home
My Program

Abstract Details

Activity Number: 591 - Synthetic Data and Data Disclosure
Type: Contributed
Date/Time: Wednesday, August 1, 2018 : 2:00 PM to 3:50 PM
Sponsor: Government Statistics Section
Abstract #328336
Title: Pre-Masking Procedure for Grouping Variables in Multivariate Data Sets
Author(s): Anna Oganian*
Companies: National Center for Health Statistics
Keywords: Statistical disclosure limitation; Clustering; Dimensionality reduction
Abstract:

Data sets subject to statistical disclosure limitation (SDL) often have many variables of different types that need to be altered to reduce disclosure risk. To produce a public data set with high utility, data protector needs to account for the relationships between the variables. Thus, ideally SDL methods should not be univariate, treating each variable independently, but multivariate, handling many variables at the same time. However, if a data set has hundreds of variables, as many government survey data do, the task of developing and implementing a multivariate approach for disclosure limitation becomes difficult. In this paper we propose a pre-masking data processing which consists of special type of clustering of variables in high dimensional data sets so that different groups of variables can be masked independently with minimal loss of data utility. By reducing the number of variables that have to be masked together the complexity of SDL reduces. The experimental results presented in the paper show good utility properties of our clustering approach.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program