Online Program Home
My Program

Abstract Details

Activity Number: 528 - Analysis of Big Data
Type: Contributed
Date/Time: Wednesday, August 1, 2018 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #330511
Title: Correct Model Selection in Big Data Analyzes
Author(s): Katherine Thompson*
Companies: University of Kentucky
Keywords: linear regression; logistic regression; model building; subset selection
Abstract:

Although recent attention has focused on improving predictive models, less consideration has been given to variability introduced into models through incorrect variable selection. Here, the difficulty in choosing a scientifically correct model is explored both theoretically and practically, and the performance of traditional model selection techniques is compared with that of more recent methods. The results in this talk show that often the model with the highest $R^2$ (or adjusted $R^2$) or lowest Akaike Information Criterion (AIC) is not the scientifically correct model, suggesting that traditional model selection techniques may not be appropriate when data sets contain a large number of covariates. This work starts with the derivation of the probability of choosing the scientifically correct model in data sets as a function of regression model parameters, and shows that traditional model selection criteria are outperformed by methods that produce multiple candidate models for researchers' consideration. These results are demonstrated both in simulation studies and through an analysis of a National Health and Nutrition Examination Survey (NHANES) data set.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program