640 – A New Age of Data Mining in the High-Performance World
A Forest Measure of Variable Importance Resistant to Correlations
Padraic G. Neville
SAS Institute
Pei-Yi Tan
SAS Institute
Variable importance estimates that are output from decision trees and random forests are often used to reduce the dimension of data, especially in the presence of many variables, because decision trees can process many variables quickly. However, trees typically inflate the importance of correlated variables and even promote irrelevant correlated variables above predictive independent variables. Strobl et al. (2008) analyze the cause and propose a remedy. Unfortunately, the remedy is too complex to be practical for a large number of observations. This paper presents a simple method, called random branch assignments, which conforms to the analysis of Strobl et al. and yet can handle many observations. Although the method still incorrectly ranks the variables when the signal-tonoise ratio is less than 1, it is dramatically less sensitive to correlation effects than the measures of variable importance in the randomForest() function in R.