Online Program Home
My Program

Abstract Details

Activity Number: 463 - Novel Uses of Text Analysis in Government Agencies
Type: Topic Contributed
Date/Time: Wednesday, August 1, 2018 : 8:30 AM to 10:20 AM
Sponsor: Government Statistics Section
Abstract #329741 Presentation
Title: Towards Automated Boilerplate Detection
Author(s): Marco Enriquez*
Companies: US Securities & Exchange Comm

Boilerplate language (i.e., generic, overly-used, and mainly inconsequential statements) is known to be pervasive in the filings that the Securities Exchange Commission collects, as firms and entities use this strategy to hedge against risk of insufficient disclosure. However, I argue that boilerplate language makes meaningful content harder to find and analyze. To help combat this phenomenon, I propose an algorithm to automatically find such language, based on Word2Vec representations, and the DBSCAN clustering algorithm. I'll conclude by highlighting some promising results on a mutual fund prospectus dataset, as well as presenting thoughts for use-cases and future work.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program