Regency EF
Evaluation of Multivariate Classification Models for Analyzing NMR Metabolomics Data (304064)
*Thao T. Vu, University of Nebraska - LincolnKeywords: metabolomics, multivariate, classification models, NMR
Analytical techniques (e.g. NMR and MS) can generate large metabolomics data sets containing thousands of spectral features derived from numerous biological observations. Multivariate data analysis is routinely used to uncover the underlying biological information contained within these large data sets by classifying the observations into groups (e.g., control versus treated) and identifying associated discriminating features. There are a variety of classification models to select from, such as partial least squares [PLS], orthogonal partial least squares [OPLS]) and machine learning algorithms (e.g., support vector machines or random forests). However, it is unclear which classification model, if any, is an optimal choice. Herein, we present a comprehensive evaluation of five common classification models routinely employed in the metabolomics field, based on simulated and experimental NMR data sets with various levels of group separation. Model performance was assessed by prediction accuracy rate, area under ROC curves, and the identification of true discriminating features. When models were stressed to subtle difference, OPLS emerged as best-performing model.