Abstract:
|
Advances in Machine Learning (ML) led to the development of highly accurate models for Quantitative Structure-Activity Relationship (QSAR) used in predicting the biological activity of a molecule with molecular descriptors. QSAR applications often require having quantitative estimations of the prediction uncertainty (PU) such as prediction intervals (PI) with the predictions. Owing to the advantage of providing estimates for both predictions and PU’s, we examined Bayesian Additive Regression Trees (BART) as a model for QSAR. In terms of prediction accuracy, BART underperformed compared to ML algorithms such as Deep Neural Network (DNN) and Light Gradient Boosting Machine (LGBM). However, estimation of PU for these ML algorithms can be quite challenging due to parameter tuning or methodological constraints. Moreover, the conditional coverage probabilities of these methods have not been studied sufficiently. In this work utilizing BART, we propose a novel method for PU estimation which is agnostic to the activity prediction algorithm, e.g. DNN, LGBM, and provides favorable conditional PI estimates compared to alternative methods using 30 diverse QSAR datasets.
|