Keywords: Missing data, Neural Networks, Cardiovascular Disease, UKBiobank, Imputation
Missing data is a common problem in survey datasets. Here we compare solutions for missingness in UKBiobank surveys and physical measurements for 500,000 participants to predict Cardiovascular diseases (CVD). The need to integrate imaging and time-series data motivates the use of neural networks, which can be trained jointly across all data modalities and disease types, such as atrial fibrillation, coronary artery disease and myocardial infarction. Missingness is complex; survey answers can be absent, refused, unknown, censored, or truncated, and devices can emit errors, or impossible values. Out of thousands of variables, even features found to have above average importance for CVD prediction can have up to 95% missing values. This occurs for variables such as “Age diabetes diagnosed”, which demonstrates the need for careful consideration of the missingness. Simplistic solutions such as listwise deletion would both remove valuable signal and make training infeasible due to a lack of data.
Here we evaluate several missing data techniques such as modeling the missingness explicitly and various imputation methods. Many missing values can reasonably be assumed to be Missing at Random given the many demographic covariates provided by UKBiobank. Baseline models with missingness modeled explicitly were tuned with feature selection using Random Forests and Bayesian Hyperparameter Optimization. Once reasonable models were selected, a comparison was made between modeling the missingness as separate channels in each input tensor, imputation, and a combination of both techniques. Models which accounted for missingness explicitly performed 4% worse at predicting CVD incidence than those that just imputed missing values. However, when predicting CVD prevalence, including a channel for missingness did not produce a significant difference between area under the ROC curves. As CVD incidence is of clinical importance, this demonstrates the benefit of imputation for Neural Networks.