Friday, February 24
PS2 Poster Session 2 and Refreshments Fri, Feb 24, 5:15 PM - 6:30 PM
Conference Center AB

Performance of Data Mining Methods in an Example with Ordinal and Imbalanced Data (303437)

*Elena Rantou, FDA 
Paul Schuette, FDA 
Mingwei Tang, University of Washington 

Keywords: Clinical site inspections, ordinal class, imbalanced data, prediction sensitivity, accuracy

In a clinical trial setting, data mining approaches can be employed to evaluate data reliability, which may be jeopardized by poorly collected/processed/reported data or, occasionally, fraudulent data. Supervised learning methods in existing R packages or others developed for this project are considered for several scenarios, using clinical trial data and the results from clinical site inspections. Models such as ordinal regression, combined binary classifiers, random forests as well as boosted trees are employed in order to predict three ordinal classes for the inspection outcomes. When necessary for specific data mining techniques, missing clinical trial data values are imputed. Synthetic Minority Over-Sampling Technique (SMOTE), sampling-based and cost-based methods have also been applied in order to deal with imbalanced outcomes in the inspection outcomes data and improve the prediction accuracy of infrequently occurring results. Cross validation methods have been employed to determine parameters providing the best fit. Covariates that are predictive of inspection outcomes have been identified based on the variable importance given by the model. All different models are compared using 5 fold cross validation. Traditional ordinal regression model does not perform well on the data set, while random forests and boosted trees show a significant improvement in the overall accuracy. The sensitivity for a single-class prediction is improved by using the SMOTE and sampling-based methods.