Online Program

Return to main conference page

All Times ET

Wednesday, June 2
Practice and Applications
Data Science Shaping the Financial World
Wed, Jun 2, 1:10 PM - 2:45 PM
TBD
 

BERT as a Filter to Detect Pharmaceutical Innovations in News Articles (309685)

Neil Alexander Kattampallil, The University of Virginia 
Gary Anderson, National Science Foundation 
*Martha Czernuszenko, The University of Virginia  
Gizem Korkmaz, The University of Virginia 
Devika Mahoney-Nair, The University of Virginia 

Keywords: bert, natural language processing, innovation, information retrieval, masked language modeling, text-based data

Innovation has traditionally been measured through surveys of selected companies such as the Business R&D and Innovation Survey and similar European surveys such as the Community Innovation Survey. Organizations have stressed the importance of exploring non-traditional methods for measuring innovation outside of these traditional surveys. Our goal is to use news articles and to develop Natural Language Processing (NLP) and Machine Learning (ML) methods to measure business innovation to enrich and complement innovation measures obtained through these surveys.

We present a novel approach utilizing Bidirectional Encoder Representation from Transformers (BERT) to detect innovation in the headlines of news articles. We focus on the pharmaceutical sector due to its high rate of innovation. We develop a BERT-based filtering technique to create an information retrieval system that detects mentions of innovation on pharmaceutical news headlines.

To identify headlines describing innovation, we leverage BERT to generate two sets of tokens: first, a set of broadly applicable innovation tokens, and second, a set of BERT predicted words for each word in a news headline. We search our second set of headline-generated tokens for the first token set and consider any matching tokens as potential innovation headlines. Then, we manually assemble a set of regulatory and financial tokens and consider an occurrence grounds to exclude a headline from our potential innovation set. We critically evaluate our results through standard classification metrics and demonstrate that our method has a 97% accuracy that correctly predicts almost half of the true innovation headlines as such.