Online Program

Return to main conference page
Friday, September 14
Fri, Sep 14, 8:00 AM - 9:15 AM
Lincoln 6
Statistical Learning and Artificial Intelligence in Drug Discovery and Development

Data Mining and Modeling Methods for Site Inspection Selection (300664)

*Elena Rantou, FDA/CDER 
Paul H Schuette, FDA/CDER 

Keywords: Data Mining, Clinical Site Inspection, Classifiers, Variable Sensitivity, Key-Risk Indicators

In a clinical trial setting, data mining approaches can be employed to evaluate data reliability, which may be jeopardized by poorly collected/processed/reported data or, occasionally, fraudulent data. We consider supervised learning methods using clinical trial data and the results from clinical site inspections. Models such as ordinal regression, combined binary classifiers, random forests, as well as boosted trees are employed to predict three ordinal classes for the inspection outcomes. When necessary for specific data mining techniques, missing clinical trial data values are imputed. Synthetic Minority Over-Sampling Technique (SMOTE), sampling-based and cost-based methods have also been applied to deal with imbalanced outcomes in the inspection outcomes data and improve the prediction accuracy of infrequently occurring results. SMOTE methods can be shown to improve the sensitivity of a single-class prediction.

In the case where two of the three ordinal classes are suppressed to one, Random Forest, Boosted Tree and Deep Neural Networks are employed to predict the binary outcome. Furthermore, we study the sensitivity of each variable to predict the outcome.

An R-shiny application has been developed that uses these supervised learning methods to predict the potentially fraudulent cases from different clinical sites. The application also cross-validates the parameter which give the best fit, and detects the covariates that are predictive of the outcomes for both ordinal and binary models. Most recent work focuses on quasi-machine learning techniques, exploring Key Risk Indicators (KRI) using linear mixed models.