Keywords: classification,CART,privacy,ACS,synthetic data
Faced with increasing public availability of large demographic databases and distributed computing, data-releasing agencies are calling upon mathematically formal privacy definitions to protect respondent identities and attributes. While paradigms such as differential privacy provide quantifiable privacy guarantees, their implementations can prove computationally intensive and difficult to apply. This is especially true for demographic surveys that collect detailed respondent attributes. For such surveys, other privacy methods can provide protection against specific attacks while maintaining survey accuracy. We detail the use of a simple machine-learning algorithm, classification trees, in creating synthetic data for the protection of categorical attributes in the American Community Survey. We discuss the difficulties in applying these algorithms to survey data and contrast these with the difficulties in using formal privacy techniques.