Online Program Home
My Program

Abstract Details

Activity Number: 432
Type: Contributed
Date/Time: Tuesday, August 2, 2016 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #320595 View Presentation
Title: Big Data Methods for Scraping Government Tax Revenue from the Web
Author(s): Brian Dumbacher* and Cavan Capps
Companies: U.S. Census Bureau and U.S. Census Bureau
Keywords: Big Data ; Web scraping ; Classification ; Text analytics ; PDF documents ; Government units

The Quarterly Summary of State and Local Government Tax Revenue (QTax) is conducted by the U.S. Census Bureau to obtain data on tax revenue collections. Much of this data is publicly available online. Instead of responding via questionnaire, some respondents direct QTax analysts to their websites. An automated process for scraping data could reduce respondent burden and increase the timeliness of data products but is challenging to develop. There are thousands of government websites with little standardization, and most publications are in Portable Document Format (PDF), a file type not readily amenable to analysis. In this research, we focus on one part of the challenge and study how Big Data methods can be used to determine whether a previously unseen PDF contains content related to government tax revenue. Our methods use Python and natural language processing tools to extract, clean, and organize data from PDFs. A corpus of PDFs is compiled for machine learning purposes, and the performances of various classifiers are compared. Lastly, we discuss how these methods, in combination with a web crawler, can be used to automate the full process of scraping data.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

Copyright © American Statistical Association