Friday, February 24
CS11 Text Analytics Fri, Feb 24, 2:00 PM - 3:30 PM
City Terrace 7

Predicting Regulatory Risk from Unstructured Text Data (303322)

*Danielle Leigh Boree, Johnson & Johnson Vision Care, Inc. 
Terri Henderson, Johnson & Johnson Vision Care, Inc. 
Jin Su, Johnson & Johnson Vision Care, Inc. 

Keywords: text analytics, data mining, predictive modeling, classification

Unstructured text data are collected across industries in many functional areas such as call centers, marketing research, medical history files, and clinical trials. These text fields may contain information which requires timely action to ensure patient safety or to satisfy regulatory requirements. Accurate review and classification of such data can prove particularly challenging when the volume is large and the cost of misclassification is high. This presentation provides a case study of how text mining and predictive modeling techniques can be used in conjunction in an automated fashion to efficiently review unstructured text files and to accurately categorize them according to risk.

The call center at a large medical device company receives several hundred calls per day from around the world. Call narratives are manually reviewed by the call center’s medical team and classified as either reportable or non-reportable to each country’s regulatory authority. However, due to the high volume of calls, differing reporting windows by country, and the risk for human error in the categorization process, there is a need to prioritize narratives for review according to their reporting risk. Natural language processing (NLP) tools were utilized to parse approximately 600,000 narratives from the past 5 years and determine if they contained any key phrases which could indicate a potential device-related safety issue. The text parsing output was then utilized as input features in a neural network data mining algorithm to assign each narrative a probability of containing a serious adverse event. In order to ensure classification within all regulatory reporting windows, text parsing and probability assignment was automated using SAS® code and scheduled to execute on a daily basis. Narratives with a predicted risk probability above a certain threshold were automatically prioritized for review by the medical team for reporting to the appropriate regulatory agency. The misclassification rate for the algorithm compared to the classification assigned by the medical expert was below 0.5%.