Thursday, February 23
PS1 Poster Session 1 and Opening Mixer Thu, Feb 23, 5:30 PM - 7:00 PM
Conference Center AB

Web Scraping Government Tax Revenue with Machine Learning (303400)

Cavan Paul Capps, U.S. Census Bureau 
*Brian Arthur Dumbacher, U.S. Census Bureau 

Keywords: Web scraping, Machine learning, Text analytics, Data wrangling, Model evaluation

The U.S. Census Bureau is researching the potential of leveraging Big Data to enhance its economic programs. One example is the Quarterly Summary of State and Local Government Tax Revenue, which collects data on tax revenue collections from state and local governments. Much of this data is publicly available and can often be found on government websites. Going directly to websites to obtain the data could reduce respondent burden and aid data review. A tool that scrapes relevant data from the web is challenging to develop as there are thousands of government websites but little standardization in terms of structure and publications. Also, most publications are in Portable Document Format (PDF), a file type not easily analyzed. To solve this problem, researchers are studying and applying Big Data methods for unstructured data, text analytics, and machine learning. The goal is to develop a web scraper with machine learning that crawls government websites and discovers PDFs, classifies each PDF as containing relevant data on tax revenue collections, and extracts the relevant data and stores it in a database. This poster describes research completed to date and other applications.