Abstract:
|
Massive datasets are computationally expensive in Bayesian posterior sampling since most Markov Chain Monte Carlo (MCMC) methods need at least O(N) operations to draw one sample (N being the number of data that is huge). In this paper, we will present a new posterior sampling method for big data applications that randomly divides the dataset into subsets of data and runs independently parallel MCMC methods on each subset using different processors. Each processor will draw samples from a predefined distribution given a subset of data and all samples from all processors are then combined using the importance resampling method to perform full-data posterior samples. We apply our method to the Bayesian Additive Regression Trees (BART) model and observe better performance compared to an alternative method, the "Consensus Monte Carlo". Our method performs better in terms of posterior sample approximations as well as run time efficiency. Furthermore, we will apply a modification to our method for BART that significantly improves posterior sampling and unlike Consensus Monte Carlo, it generates posterior distributions that are indistinguishable from the full-data posterior.
|