JSM 2012 | eventScribe Itinerary Planner

Technical Support

Phone: (410) 638-9239

Fax: (410) 638-6108

GoToMeeting: Meet Now!

Web: www.CadmiumCD.com

←Back

186 – Advances in Missing Data Methods

Missing Value Imputation for Predictive Models on Large and Distributed Data Sources

Sponsor: Social Statistics Section

Keywords: missing value imputation, basic statistics, MapReduce

Jane Chu

IBM

Sier Han

IBM

Jing Shyr

IBM

The paper proposes a method to impute missing values of predictors for the subsequent predictive models on large and distributed data sources using a Map-Reduce approach. Firstly, for each predictor that has missing values, imputation models based only on the target variable are built independently on different data sources and on different machines using the Map functions. During the step, validation samples are extracted randomly across all data sources and merged into one global validation sample along with the collection of imputation models using the Reduce function. Then all imputation models are evaluated based on the global validation sample in a distributed manner using another set of Map functions to select the top K models and form an ensemble model. Thirdly, the ensemble model is sent to each data source to impute missing values of predictors. Finally, the complete dataset can be used to build any models for prediction as well as discovery and interpretation of relationships between the target and a set of predictors. Different types of imputation models are built based on whether the predictor and target are categorical or continuous. Since only the target variable is used, only basic statistics between the predictor and target variables, such as means, variances, co-variance, counts, etc. need to be collected using a single data pass which is important for the large and distributed data sources.

View Paper

"eventScribe", the eventScribe logo, "CadmiumCD", and the CadmiumCD logo are trademarks of CadmiumCD LLC, and may not be copied, imitated or used, in whole or in part, without prior written permission from CadmiumCD. The appearance of these proceedings, customized graphics that are unique to these proceedings, and customized scripts are the service mark, trademark and/or trade dress of CadmiumCD and may not be copied, imitated or used, in whole or in part, without prior written notification. All other trademarks, slogans, company names or logos are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, owner, or otherwise does not constitute or imply endorsement, sponsorship, or recommendation thereof by CadmiumCD.

As a user you may provide CadmiumCD with feedback. Any ideas or suggestions you provide through any feedback mechanisms on these proceedings may be used by CadmiumCD, at our sole discretion, including future modifications to the eventScribe product. You hereby grant to CadmiumCD and our assigns a perpetual, worldwide, fully transferable, sublicensable, irrevocable, royalty free license to use, reproduce, modify, create derivative works from, distribute, and display the feedback in any manner and for any purpose.