Abstract:
|
Boilerplate language (i.e., generic, overly-used, and mainly inconsequential statements) is known to be pervasive in the filings that the Securities Exchange Commission collects, as firms and entities use this strategy to hedge against risk of insufficient disclosure. However, I argue that boilerplate language makes meaningful content harder to find and analyze. To help combat this phenomenon, I propose an algorithm to automatically find such language, based on Word2Vec representations, and the DBSCAN clustering algorithm. I'll conclude by highlighting some promising results on a mutual fund prospectus dataset, as well as presenting thoughts for use-cases and future work.
|