The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.
Online Program Home
Abstract Details
Activity Number:
|
186
|
Type:
|
Contributed
|
Date/Time:
|
Monday, July 30, 2012 : 10:30 AM to 12:20 PM
|
Sponsor:
|
Social Statistics Section
|
Abstract - #306533 |
Title:
|
Missing Value Imputation for Predictive Models on Large and Distributed Data Sources
|
Author(s):
|
Jing Shyr*+ and Sier Han and Jane Chu
|
Companies:
|
IBM and IBM and IBM
|
Address:
|
233 S. Wacker Drive, Chicago, IL, 60606, United States
|
Keywords:
|
missing value imputation ;
basic statistics ;
MapReduce
|
Abstract:
|
The paper proposes a method to impute missing values of predictors for the subsequent predictive models on large and distributed data sources using a Map-Reduce approach. Firstly, for each predictor that has missing values, imputation models based only on the target variable are built independently on different data sources and on different machines using the Map functions. During the step, validation samples are extracted randomly across all data sources and merged into one global validation sample along with the collection of imputation models using the Reduce function. Then all imputation models are evaluated based on the global validation sample in a distributed manner using another set of Map functions to select the top K models and form an ensemble model. Thirdly, the ensemble model is sent to each data source to impute missing values of predictors. Finally, the complete dataset can be used to build any models for prediction as well as discovery and interpretation of relationships between the target and a set of predictors.
Different types of imputation models are built based on whether the predictor and target are categorical or continuous. Since only the target variable is used, only basic statistics between the predictor and target variables, such as means, variances, covariance, counts, etc. need to be collected using a single data pass which is important for the large and distributed data sources.
|
The address information is for the authors that have a + after their name.
Authors who are presenting talks have a * after their name.
Back to the full JSM 2012 program
|
2012 JSM Online Program Home
For information, contact jsm@amstat.org or phone (888) 231-3473.
If you have questions about the Continuing Education program, please contact the Education Department.