Activity Number:
|
465
- Privacy, Confidentiality, and Disclosure Limitation
|
Type:
|
Contributed
|
Date/Time:
|
Thursday, August 6, 2020 : 10:00 AM to 2:00 PM
|
Sponsor:
|
Government Statistics Section
|
Abstract #313556
|
|
Title:
|
A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications
|
Author(s):
|
Aaron Williams* and Claire Bowen and Len Burman and Surachai Khitatrakun and Kyle Ueyama
|
Companies:
|
and Urban Institute and Urban Institute and Urban Institute and Urban Institute
|
Keywords:
|
Synthetic data;
statistical disclosure control;
privacy;
confidentiality;
CART
|
Abstract:
|
Government agencies possess data that could be invaluable for evaluating public policy, but often may not be released publicly due to disclosure concerns. For instance, the IRS has a database of information returns that could answer many questions about Americans with incomes below the income tax filing threshold. These data represent many of the lowest income and toughest to understand Americans. This paper analyzes the application of synthetic data generation for statistical disclosure control on IRS microlevel data from information returns. We use sequential Classification and Regression Trees (CART) and kernel density smoothing to create a new synthetic microdata file of people who did not file a tax return in 2012. We added more noise in sparser parts of distributions than denser parts to avoid disclosure. This synthetic data file represents previously unreleased information useful for tax modeling. We tested and evaluated the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data set has high utility, particularly for summary statistics and microsimulation, and low disclosure risk.
|
Authors who are presenting talks have a * after their name.
Back to the full JSM 2020 program
|