580 – Disease Prediction
Statistical Strategies for Developing Classification Algorithms with Application to Insulin Sensitivity Status
William D. Johnson
Pennington Biomedical Research Center
Bin Li
Louisiana State University
Eric Ravussin
Pennington Biomedical Research Center
Charmaine S. Tam
Pennington Biomedical Research Center
Wenting Xie
Insulin resistance is a strong precursor to the development of the metabolic syndrome and type 2 diabetes. The hyperinsulinemic-euglycemic clamp, the gold standard for assessing insulin resistance in humans, is labor-intensive and expensive and thus examining surrogate markers for insulin resistance is necessary. In this paper, we incorporated the newer statistical algorithms to boost accuracy of insulin prediction. Data including subject characteristics (age, ethnicity, sex), body composition (BMI) and blood biochemistry (glucose, insulin) were obtained from 270 individuals participating in research studies at the Pennington Biomedical Research Center in Louisiana between 2001 and 2011. Using these data, we applied and compared four statistical methods to predict insulin resistance including classical logistic regression, and the newer methods of single classification tree, boosted regression tree (BRT) and random forest (RF) as well as a novel approach of combining logistic regression and featured selection from BRT or RF. Random forest (AUC=0.858) and boosted regression tree (AUC=0.845) gave the best prediction performance for predicting insulin resistance. This was followed by logistic regression method combined with feature selection technique from BRT or RF (AUC=0.763) and finally single classification tree (AUC=0.741). However, when using variables without a large portion of missing values we found that logistic regression (AUC=0.84) gave the best prediction performance. The result shows that boosted regression tree and random forest approaches may provide better algorithms where missing data may be an issue. We also found an appropriate combination of traditional logistic regression and variable selection from BRT or RF may improve model performance. Logistic regression is still appropriate when missing data may not be a factor. In conclusion, we have illustrated the exploration of different statistical models when determining prediction performance in biomedical studies.