532 – Can Statistics Inform Decisions in Social, Economic, and Political Event?
Quality and Validity Testing of Sparse Form Data using Gaussian Mixture Models
Anne Parker
US Internal Revenue Service
Danielle Gewurz
Deloitte Consulting
William J. J. Roberts
Deloitte Consulting
Recommendation systems are a family of unsupervised machine learning approaches widely used in commercial industry that can estimate either non-existent or missing values as well as identify outliers and anomalies. This approach is both trained on and applied to the very same set of data. This obviates the need for the degree of curated training data necessary for predictive modeling in a supervised context. In this paper, we describe the application of one such approach in which a collaborative filtering model is trained to identify population parameters using sparse data to identify anomalous values among a populations of millions of observations across a range of data fields. Estimates are produced for each field based on the entire population for that field as well as the other fields associated with that observation. Anomalies in each observation are detected by estimating the expected value of each data field and subsequently comparing those estimates against observed values. After preliminary testing, the performance of the collaborative filtering model improves upon current methods of identifying anomalies within the IRS.