‹‹ Go Back

Brian Dumbacher

U.S. Census Bureau



‹‹ Go Back

Cavan Capps

U.S. Census Bureau



‹‹ Go Back

Please enter your access key

The asset you are trying to access is locked for premium users. Please enter your access key to unlock.


Email This Presentation:

From:

To:

Subject:

Body:

←Back IconGems-Print

432 – New Approaches in Classification Methods

Big Data Methods for Scraping Government Tax Revenue from the Web

Sponsor: Section on Statistical Learning and Data Science
Keywords: Big Data, Web scraping, Classification, Text analytics, PDF documents, Government units

Brian Dumbacher

U.S. Census Bureau

Cavan Capps

U.S. Census Bureau

The Quarterly Summary of State and Local Government Tax Revenue (QTax) is conducted by the U.S. Census Bureau to obtain data on tax revenue collections. Much of this data is publicly available online. Instead of responding via questionnaire, some respondents direct QTax analysts to their websites. An automated process for scraping data could reduce respondent burden and increase the timeliness of data products but is challenging to develop. There are thousands of government websites with little standardization, and most publications are in Portable Document Format (PDF), a file type not readily amenable to analysis. In this research, we focus on one part of the challenge and study how Big Data methods can be used to determine whether a previously unseen PDF contains content related to government tax revenue. Our methods use Python and natural language processing tools to extract, clean, and organize data from PDFs. A corpus of PDFs is compiled for machine learning purposes, and the performances of various classifiers are compared. Lastly, we discuss how these methods, in combination with a web crawler, can be used to automate the full process of scraping data.

"eventScribe", the eventScribe logo, "CadmiumCD", and the CadmiumCD logo are trademarks of CadmiumCD LLC, and may not be copied, imitated or used, in whole or in part, without prior written permission from CadmiumCD. The appearance of these proceedings, customized graphics that are unique to these proceedings, and customized scripts are the service mark, trademark and/or trade dress of CadmiumCD and may not be copied, imitated or used, in whole or in part, without prior written notification. All other trademarks, slogans, company names or logos are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, owner, or otherwise does not constitute or imply endorsement, sponsorship, or recommendation thereof by CadmiumCD.

As a user you may provide CadmiumCD with feedback. Any ideas or suggestions you provide through any feedback mechanisms on these proceedings may be used by CadmiumCD, at our sole discretion, including future modifications to the eventScribe product. You hereby grant to CadmiumCD and our assigns a perpetual, worldwide, fully transferable, sublicensable, irrevocable, royalty free license to use, reproduce, modify, create derivative works from, distribute, and display the feedback in any manner and for any purpose.

© 2016 CadmiumCD