Abstract:
|
Missing data are ubiquitous in applied settings and can occur for a variety of reasons including but not limited to failing sensors or a reporting error. Data imputation is often used as a pre-processing step to address missing data values prior to fitting a prediction model such as a neural net or a regression. In practice, Cross Validation is often used to fit the prediction model by using training and validation partitions of the data, but maintaining the same separation for fitting the imputation model is often ignored. An imputation model tuned in this way leads to the validation set no longer being an independent assessment of model fit, because data from the validation set is used to impute missing values in the training set. Multiple imputation corrects for prediction model standard errors, but doesn't address the corruption of the training and validation partitioning. In this talk, I'll discuss a method for resolving this by using imputation models designed for streaming data. I'll demonstrate this using Automated Data Imputation, an empirically-tuned, streaming matrix completion method.
|