Abstract:
|
Although probability samples have been regarded as the gold-standard to collect information for population-based study, non-probability samples have been used frequently in practice due to low cost, convenience, and the difficulties for creating the sampling frames. Naïve estimates based on non-probability samples without any adjustments may be misleading due to the selection bias. Recently, valid data integration approach including mass imputation, propensity score weighting, and calibration has been used to improve the representativeness of non-probability samples. However, the effectiveness of mass imputation approach depends on the underlying model assumption. In this paper, we propose and compare several modern machine learning (ML) based mass imputation approaches including generalized additive modeling (GAM), regression tree, random forest, XG-boosting, Support vector machine, and deep learning. We evaluate our proposed methods in terms of relative bias, relative standard error, and relative root mean squared error, by using both simulation study and real application. ML based method outperformed GAM when there are non-linear correlations in the data.
|