Abstract:
|
Recently, the synthesis of datasets with artificially missing data, for the purpose of testing missing data handling methods such as imputation, has been made easier by the availability of open source code libraries (e.g. described in Muzellec et al., 2020). These code libraries implement data masking aligned to classic mechanisms of missingness (e.g. MCAR, MAR, MNAR), where data points may be missing as a function of other observed data points. However, in our operational machine learning system, imputation occurs on engineered features, while missingness originates in raw data. We believe this case to be common in applied ML. Here, we simulate data missingness using bespoke code that mimics our data generating process, and compare results to publicly available methods of data masking, in the context of a comparison of different imputer types. We find that the ordering of imputer types by performance is generally robust to masking mechanism, that sensitivity to percent missing varies, particularly among MAR and MNAR mechanisms, and that simple masking can underestimate the magnitude of error when more complex MAR and MNAR exists in the real data.
|