Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 548 - Using Artificial Intelligence and Advanced Statistical Methods to Improve Official Statistics
Type: Topic Contributed
Date/Time: Thursday, August 6, 2020 : 1:00 PM to 2:50 PM
Sponsor: Government Statistics Section
Abstract #311138
Title: NAICS Code Prediction Using Supervised Methods
Author(s): Anne Parker and Evan Schulz and Christine Oehlert*
Companies: Internal Revenue Service and Internal Revenue Service and Internal Revenue Service
Keywords: NAICS codes; Random forest; CART; Tax compliance; Machine learning

When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, NAICS codes are self-reported on tax forms and mistakes have no tax consequences, so they are often unreliable. IRS’s Statistics of Income (SOI) program validates NAICS codes for businesses in their samples, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Form 1120). For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (CART and random forest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify NAICS for over half of businesses with no informative reported NAICS.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program