Conference Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 149 - Statistical Learning for Decision Support
Type: Contributed
Date/Time: Monday, August 8, 2022 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #323527
Title: How Real Is Synthetic Missing Data? Impact of Missing Pattern Modeling on Imputer Evaluation
Author(s): Rohan Chakraborty* and Ambar Kleinbort and Janelle Szary and Anne Thissen-Roe
Companies: pymetrics and pymetrics and pymetrics and pymetrics
Keywords: imputation; synthetic missing data; missing data; feature engineering; machine learning; MNAR
Abstract:

Recently, the synthesis of datasets with artificially missing data, for the purpose of testing missing data handling methods such as imputation, has been made easier by the availability of open source code libraries (e.g. described in Muzellec et al., 2020). These code libraries implement data masking aligned to classic mechanisms of missingness (e.g. MCAR, MAR, MNAR), where data points may be missing as a function of other observed data points. However, in our operational machine learning system, imputation occurs on engineered features, while missingness originates in raw data. We believe this case to be common in applied ML. Here, we simulate data missingness using bespoke code that mimics our data generating process, and compare results to publicly available methods of data masking, in the context of a comparison of different imputer types. We find that the ordering of imputer types by performance is generally robust to masking mechanism, that sensitivity to percent missing varies, particularly among MAR and MNAR mechanisms, and that simple masking can underestimate the magnitude of error when more complex MAR and MNAR exists in the real data.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2022 program