Online Program

Return to main conference page
Thursday, May 30
Data Science Techologies
Practice and Applications
Data Science Applications E-Posters, II
Thu, May 30, 5:30 PM - 6:30 PM
Grand Ballroom Foyer

Handling Missing Data in Cardiovascular Disease Prediction Using Neural Networks (306357)


Puneet Batra, Broad Institute 
Samuel Friedman, Broad Institute 
*Megan Shand, Broad Institute 
Kaan Yuksel, Broad Institute 

Keywords: Missing data, Neural Networks, Cardiovascular Disease, UKBiobank, Imputation

Missing data is a common problem in survey datasets. Here we compare solutions for missingness in UKBiobank surveys and physical measurements for 500,000 participants to predict Cardiovascular diseases (CVD). The need to integrate imaging and time-series data motivates the use of neural networks, which can be trained jointly across all data modalities and disease types, such as atrial fibrillation, coronary artery disease and myocardial infarction. Missingness is complex; survey answers can be absent, refused, unknown, censored, or truncated, and devices can emit errors, or impossible values. Out of thousands of variables, even features found to have above average importance for CVD prediction can have up to 95% missing values. This occurs for variables such as “Age diabetes diagnosed”, which demonstrates the need for careful consideration of the missingness. Simplistic solutions such as listwise deletion would both remove valuable signal and make training infeasible due to a lack of data.

Here we evaluate several missing data techniques such as modeling the missingness explicitly and various imputation methods. Many missing values can reasonably be assumed to be Missing at Random given the many demographic covariates provided by UKBiobank. Baseline models with missingness modeled explicitly were tuned with feature selection using Random Forests and Bayesian Hyperparameter Optimization. Once reasonable models were selected, a comparison was made between modeling the missingness as separate channels in each input tensor, imputation, and a combination of both techniques. Models which accounted for missingness explicitly performed 4% worse at predicting CVD incidence than those that just imputed missing values. However, when predicting CVD prevalence, including a channel for missingness did not produce a significant difference between area under the ROC curves. As CVD incidence is of clinical importance, this demonstrates the benefit of imputation for Neural Networks.