|
Activity Number:
|
385
|
|
Type:
|
Contributed
|
|
Date/Time:
|
Wednesday, August 9, 2006 : 8:30 AM to 10:20 AM
|
|
Sponsor:
|
Section on Bayesian Statistical Science
|
| Abstract - #306391 |
|
Title:
|
Avoiding Bias from Feature Selection in Classification and Regression Models
|
|
Author(s):
|
Longhai Li*+ and Jianguo Zhang and Radford Neal
|
|
Companies:
|
University of Toronto and University of Toronto and University of Toronto
|
|
Address:
|
Department of Statistics, Toronto, ON, M5S3G3, Canada
|
|
Keywords:
|
feature selection bias ; optimistic ; mixture models ; factor analysis ; gene expression ; naive Bayes models
|
|
Abstract:
|
For many classification and regression problems, a large number of features are available for possible use. Often, for computational or other reasons, only a small subset of these features are selected for use in a model, based on some simple measure such as correlation with the response variable. This procedure may introduce an optimistic bias. We show how this bias can be avoided when using a Bayesian model for the joint distribution of features and response. The crucial insight is that we should retain the knowledge that the discarded features correlate with the response weakly. We describe implementations for naive Bayes models of real and binary data, and for factor analysis and two-component mixture models. Experiments with artificial data confirm that this method avoids bias due to the feature selection. We also apply these models to actual gene expression data.
|