Abstract:
|
Assessing the relative contribution of subsets of features in predicting the response is often of interest in predictive modeling applications. The variable importance measure used is commonly determined by the prediction technique employed, creating a tradeoff: restrictive assumptions are often necessary for valid statistical inference on the true importance. Rather than considering importance as a function of a specific prediction algorithm, it is useful to consider variable importance as a function of the true data-generating mechanism. In this work, we study variable importance measures that may be used with any prediction technique, and their interpretation is agnostic to the technique used. In particular, we study differences of U-statistic-based risk functionals. We discuss how these measures may be flexibly estimated using machine learning techniques, show that a plug-in estimator of the importance is efficient, and describe a procedure for constructing a valid confidence interval. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of risk factors for cardiovascular disease in South Africa.
|