JSM 2012 | eventScribe Itinerary Planner

←Back

580 – Disease Prediction

Statistical Strategies for Developing Classification Algorithms with Application to Insulin Sensitivity Status

Sponsor: Section on Statistics in Epidemiology

Keywords: Boosted Regression Tree, RandomForest, Tree based methods, Logistical regression, Insulin Sensitivity Status, metabolic markers

William D. Johnson

Pennington Biomedical Research Center

Bin Li

Louisiana State University

Eric Ravussin

Pennington Biomedical Research Center

Charmaine S. Tam

Pennington Biomedical Research Center

Wenting Xie

Insulin resistance is a strong precursor to the development of the metabolic syndrome and type 2 diabetes. The hyperinsulinemic-euglycemic clamp, the gold standard for assessing insulin resistance in humans, is labor-intensive and expensive and thus examining surrogate markers for insulin resistance is necessary. In this paper, we incorporated the newer statistical algorithms to boost accuracy of insulin prediction. Data including subject characteristics (age, ethnicity, sex), body composition (BMI) and blood biochemistry (glucose, insulin) were obtained from 270 individuals participating in research studies at the Pennington Biomedical Research Center in Louisiana between 2001 and 2011. Using these data, we applied and compared four statistical methods to predict insulin resistance including classical logistic regression, and the newer methods of single classification tree, boosted regression tree (BRT) and random forest (RF) as well as a novel approach of combining logistic regression and featured selection from BRT or RF. Random forest (AUC=0.858) and boosted regression tree (AUC=0.845) gave the best prediction performance for predicting insulin resistance. This was followed by logistic regression method combined with feature selection technique from BRT or RF (AUC=0.763) and finally single classification tree (AUC=0.741). However, when using variables without a large portion of missing values we found that logistic regression (AUC=0.84) gave the best prediction performance. The result shows that boosted regression tree and random forest approaches may provide better algorithms where missing data may be an issue. We also found an appropriate combination of traditional logistic regression and variable selection from BRT or RF may improve model performance. Logistic regression is still appropriate when missing data may not be a factor. In conclusion, we have illustrated the exploration of different statistical models when determining prediction performance in biomedical studies.

View Paper

Technical Support

Statistical Strategies for Developing Classification Algorithms with Application to Insulin Sensitivity Status