|Saturday, February 17|
|PS3 Poster Session 3 and Continental Breakfast||
Sat, Feb 17, 8:00 AM - 9:15 AM
Thematic Feature Selection for Research Support (303677)
Keywords: big data, microdata, R, machine learning, feature selection, variable selection, social science, public-use data
Social scientists often use large datasets like the Current Population Survey (CPS) and Survey of Income and Program Participation (SIPP) for research on topics such as labor markets, demographics, and macroeconomics. Selection of appropriate variables (features) for inclusion in statistical models is a critical activity in social science research that is painstaking and time-consuming. Typical approaches to feature selection are either (1) parsimonious selection of a minimal number of features a priori using subject matter expertise, or (2) inclusion of all or nearly all variables with little attention paid to selection. Both approaches have shortcomings and using datasets such as the CPS and the SIPP is vulnerable to those limitations. As such, this paper looks to data science techniques to create a structured application for feature selection on large datasets. These feature selection methods will be evaluated in terms of their ability to select important and relevant features and their impact on model performance using R.