Abstract:
|
Feature selection is known to be a difficult problem in high-dimensions; however, it becomes even more challenging when the features are themselves highly sparse (or "rare"), as is the case in many modern data types, including text data. We prove that ordinary least squares in such situations is inconsistent even when the dimension remains fixed. Furthermore, we prove that the lasso fails to achieve support recovery when the design matrix is highly sparse. After presenting the challenge of variable selection in the presence of such "rare features," we propose a new method that overcomes these challenges by exploiting extra information about the relationships between features. We show that our approach succeeds in rare feature selection where other methods fail and demonstrate on a text dataset how this new approach leads to more predictive models and more interpretable outputs.
|