Name: 2018 Joint Statistical Meetings
Start: 2018-07-28T07:00:00+00:00
End: 2018-08-02
Location: Vancouver Convention Centre

Activity Number:	463 - Novel Uses of Text Analysis in Government Agencies
Type:	Topic Contributed
Date/Time:	Wednesday, August 1, 2018 : 8:30 AM to 10:20 AM
Sponsor:	Government Statistics Section
Abstract #329741	Presentation
Title:	Towards Automated Boilerplate Detection
Author(s):	Marco Enriquez*
Companies:	US Securities & Exchange Comm
Keywords:
Abstract:	Boilerplate language (i.e., generic, overly-used, and mainly inconsequential statements) is known to be pervasive in the filings that the Securities Exchange Commission collects, as firms and entities use this strategy to hedge against risk of insufficient disclosure. However, I argue that boilerplate language makes meaningful content harder to find and analyze. To help combat this phenomenon, I propose an algorithm to automatically find such language, based on Word2Vec representations, and the DBSCAN clustering algorithm. I'll conclude by highlighting some promising results on a mutual fund prospectus dataset, as well as presenting thoughts for use-cases and future work.

Authors who are presenting talks have a * after their name.

JSM 2018 Online Program