St. James Ballroom
A Comparison of Random Forest Variable Selection Methods for Classification Modeling (303814)
Eddie Ip, Wake Forest University School of MedicineMike Miller, Wake Forest University School of Medicine
*Jaime Lynn Speiser, Wake Forest University School of Medicine
Janet Tooze, Wake Forest University School of Medicine
Keywords: random forest, variable selection, prediction modeling
Random forest classification is a popular machine learning method for developing prediction models. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, and computation times for variable selection methods. A significant contribution of our study is the ability to assess different variable selection techniques in the setting of random forest classification.