Abstract:
|
High-dimensional data are routinely collected in many application areas. In this article, we are particularly interested in classification models where the predictors are imbalanced. This creates well-known difficulties in estimation. To tackle this, Bayesian approaches with appropriate priors are often used, with Markov chain Monte Carlo (MCMC) algorithms used for posterior computation. However, current MCMC algorithms can be inefficient as the size of the dataset increases due to worsening time per step. One promising strategy is to use a gradient-based sampler while relying on data subsamples to reduce the computational complexity per step. However, usual subsampling strategies break down when applied to imbalanced data. Instead, we propose to generalize recent piece-wise deterministic MCMC algorithms to include stratified and importance-weighted subsampling, and also propose a new subsampling algorithm based on sorting data points. These approaches maintain the correct stationary distribution with arbitrarily small subsamples and substantially outperforms current competitors. We provide theoretical support and illustrate gains in multiple simulated and real data applications.
|