Abstract:
|
Nonprobability samples happen frequently in practice including medicine, epidemiology, public opinion, and other research fields. It is well known that naïve estimates based on nonprobability samples may suffer from selection biases. Data integration by combining information from nonprobability samples and probability samples have been shown to be one of the effective ways to handle nonprobability samples. However, the validity of data integration approaches depends on the underlying model assumptions. Modern machine learning approaches including generalized additive modeling, random forest, XGboost, and deep learning have been shown to be somewhat robust against the failure of those model assumptions. In this paper, we compare different machine learning based data integration approaches via simulation study and real application. XGBoost and Deep learning approaches have been shown to outperform other machine learning approaches in terms of balancing bias and variance.
|