Online Program Home
My Program

Abstract Details

Activity Number: 253 - Contributed Poster Presentations: Section on Statistical Computing
Type: Contributed
Date/Time: Monday, July 30, 2018 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistical Computing
Abstract #330501
Title: Widespread (Unintentional) Corruption of Cross Validation Techniques for Prediction Models on Imputed Data Sets
Author(s): Milo Page* and Alyson Wilson and Chris Gotwalt
Companies: NC State University/JMP and North Carolina State University and JMP
Keywords: Missing Data; Imputed Data; Model Tuning; Streaming Data; Matrix Completion

Missing data are ubiquitous in applied settings and can occur for a variety of reasons including but not limited to failing sensors or a reporting error. Data imputation is often used as a pre-processing step to address missing data values prior to fitting a prediction model such as a neural net or a regression. In practice, Cross Validation is often used to fit the prediction model by using training and validation partitions of the data, but maintaining the same separation for fitting the imputation model is often ignored. An imputation model tuned in this way leads to the validation set no longer being an independent assessment of model fit, because data from the validation set is used to impute missing values in the training set. Multiple imputation corrects for prediction model standard errors, but doesn't address the corruption of the training and validation partitioning. In this talk, I'll discuss a method for resolving this by using imputation models designed for streaming data. I'll demonstrate this using Automated Data Imputation, an empirically-tuned, streaming matrix completion method.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program