Abstract:
|
The advancement of data collection and storage technology produce big volume data for clinical and basic science research, such as the electronic health/medical records with hundreds and even more variables. As a commonly used data imputation technique, machine-learning methods are promising in dealing with complicated correlations in big data. However, their statistical properties are not well studied, such as the deep learning. It is urgent to have a practical guide for the application of machine learning methods on the missing data analysis. Therefore, we design a comprehensive simulation study of missing data analysis to evaluate the performance of classical statistical methods, high-dimensional model, classical machine-learning methods, and deep learning. In the simulation, we consider low- and high-dimensional data size, linear and non-linear correlations among variables. The imputation bias and variance of the different methods are compared. Our study will provide guidance for investigators wishing to use machine-learning methods for data imputation, and promote more machine-learning based application and theory study.
|