186 – Advances in Missing Data Methods
Missing Value Imputation for Predictive Models on Large and Distributed Data Sources
Jane Chu
IBM
Sier Han
IBM
Jing Shyr
IBM
The paper proposes a method to impute missing values of predictors for the subsequent predictive models on large and distributed data sources using a Map-Reduce approach. Firstly, for each predictor that has missing values, imputation models based only on the target variable are built independently on different data sources and on different machines using the Map functions. During the step, validation samples are extracted randomly across all data sources and merged into one global validation sample along with the collection of imputation models using the Reduce function. Then all imputation models are evaluated based on the global validation sample in a distributed manner using another set of Map functions to select the top K models and form an ensemble model. Thirdly, the ensemble model is sent to each data source to impute missing values of predictors. Finally, the complete dataset can be used to build any models for prediction as well as discovery and interpretation of relationships between the target and a set of predictors. Different types of imputation models are built based on whether the predictor and target are categorical or continuous. Since only the target variable is used, only basic statistics between the predictor and target variables, such as means, variances, co-variance, counts, etc. need to be collected using a single data pass which is important for the large and distributed data sources.