Abstract:
|
Random forest (RF) has demonstrated the ability to select important variables and model complex data. However, due to the random sampling of data points and variables within RF algorithm, rankings of the selected variables can alter among fitted models to the same data set. This can result in selecting a noise variable over a main variable. This research investigates intersection and average methods to stabilize RF's variable selection. First, multiple RF models are fitted to the data, and ranking of variables and their relative importance are evaluated for each model. Average method ranks the variables based on their mean relative importance. Intersection method iteratively selects variables that are in common among top-ranked variables of these models. These methods also showed potential in detecting main effects in interaction terms.
|