Like most industries, pharmaceutical companies capture the vast majority of their experimental or generated data in structured relational databases. These provide rapid access to tabular data suitable for statistical or machine learning analyses, even at scale. However, most of the non-experimental data come from external sources in unstructured or semi-structured formats such as journal articles, conference proceedings, news, emails, and reports. While most of the scientific decisions are based on the structured experimental data, many of our strategic decisions rely upon the unstructured data. We require the ability to transform this data into actionable insights using analytical techniques due to the ever-increasing volume of information and our inability to effectively consume it.
This paper focuses on a case study that involve text classification as part of an extended text search. The challenge addressed is to identify and classify sentences that are related to an extended semantic concept defined by a list of words and phrases. The particular problem focuses on providing oversight for the sales force in their online training materials, but the solution is applicable to a diverse set of problems and industries.