JSM 2006 Online Program

Activity Number:	385
Type:	Contributed
Date/Time:	Wednesday, August 9, 2006 : 8:30 AM to 10:20 AM
Sponsor:	Section on Bayesian Statistical Science
Abstract - #306391
Title:	Avoiding Bias from Feature Selection in Classification and Regression Models
Author(s):	Longhai Li*+ and Jianguo Zhang and Radford Neal
Companies:	University of Toronto and University of Toronto and University of Toronto
Address:	Department of Statistics, Toronto, ON, M5S3G3, Canada
Keywords:	feature selection bias ; optimistic ; mixture models ; factor analysis ; gene expression ; naive Bayes models
Abstract:	For many classification and regression problems, a large number of features are available for possible use. Often, for computational or other reasons, only a small subset of these features are selected for use in a model, based on some simple measure such as correlation with the response variable. This procedure may introduce an optimistic bias. We show how this bias can be avoided when using a Bayesian model for the joint distribution of features and response. The crucial insight is that we should retain the knowledge that the discarded features correlate with the response weakly. We describe implementations for naive Bayes models of real and binary data, and for factor analysis and two-component mixture models. Experiments with artificial data confirm that this method avoids bias due to the feature selection. We also apply these models to actual gene expression data.


This is the preliminary program for the 2006 Joint Statistical Meetings in Seattle, Washington.
The views expressed here are those of the individual authors and not necessarily those of the ASA or its board, officers, or staff. Back to main JSM 2006 Program page