Abstract:
|
When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, NAICS codes are self-reported on tax forms and mistakes have no tax consequences, so they are often unreliable. IRS’s Statistics of Income (SOI) program validates NAICS codes for businesses in their samples, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Form 1120). For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (CART and random forest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify NAICS for over half of businesses with no informative reported NAICS.
|