Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 465 - Privacy, Confidentiality, and Disclosure Limitation
Type: Contributed
Date/Time: Thursday, August 6, 2020 : 10:00 AM to 2:00 PM
Sponsor: Government Statistics Section
Abstract #313556
Title: A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications
Author(s): Aaron Williams* and Claire Bowen and Len Burman and Surachai Khitatrakun and Kyle Ueyama
Companies: and Urban Institute and Urban Institute and Urban Institute and Urban Institute
Keywords: Synthetic data; statistical disclosure control; privacy; confidentiality; CART
Abstract:

Government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the IRS has a database of information returns that could answer many questions about Americans with incomes below the income tax filing threshold. These data represent many of the lowest income and toughest to understand Americans. This paper analyzes the application of synthetic data generation for statistical disclosure control on IRS microlevel data from information returns. We use sequential Classification and Regression Trees (CART) and kernel density smoothing to create a new synthetic microdata file of people who did not file a tax return in 2012. We added more noise in sparser parts of distributions than denser parts to avoid disclosure. This synthetic data file represents previously unreleased information useful for tax modeling. We tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program