JSM 2017 Online Program

Activity Number:	519 - Sparse Statistical Learning
Type:	Contributed
Date/Time:	Wednesday, August 2, 2017 : 10:30 AM to 12:20 PM
Sponsor:	Section on Statistical Learning and Data Science
Abstract #323580
Title:	Rare Feature Selection in High Dimensions
Author(s):	Xiaohan Yan* and Jacob Bien
Companies:	Cornell University and Cornell University
Keywords:	high-dimensions ; feature selection ; sparse features ; lasso
Abstract:	Feature selection is known to be a difficult problem in high-dimensions; however, it becomes even more challenging when the features are themselves highly sparse (or "rare"), as is the case in many modern data types, including text data. We prove that ordinary least squares in such situations is inconsistent even when the dimension remains fixed. Furthermore, we prove that the lasso fails to achieve support recovery when the design matrix is highly sparse. After presenting the challenge of variable selection in the presence of such "rare features," we propose a new method that overcomes these challenges by exploiting extra information about the relationships between features. We show that our approach succeeds in rare feature selection where other methods fail and demonstrate on a text dataset how this new approach leads to more predictive models and more interpretable outputs.

Authors who are presenting talks have a * after their name.