Abstract:
|
It is often of interest in a regression problem to measure the "importance" of each feature in predicting the response. Classically, variable importance methods make a trade-off between flexibility and inference; either the method works only for parametric models and allows inference, or the method allows for flexible estimation procedures but does not allow inference and generally does not have well understood asymptotic properties. We propose an extension of ANOVA that can be applied with general complex machine-learning-based prediction methods to flexibly estimate the additional proportion of the total variability in the outcome explained by a single feature or group of features. Using the tools of targeted learning, we show that under some conditions, we get efficient estimates of variable importance with asymptotically valid confidence intervals, while fitting any flexible estimation procedure. We demonstrate the performance of this ANOVA extension in the context of a study of the median house price in the Boston area and in a study of risk factors for cardiovascular disease in South Africa.
|