Bayesian cluster analysis aims at inferring the number of data clusters present in a data set using either finite or infinite mixture models. In Bayesian finite mixture models usually a one-to-one relationship between components and data clusters is assumed. The number of components can be determined by comparing the marginal likelihoods of the potential models or by approximating the posterior of the number of components using different methods, e.g., reversible jump MCMC, Markov birth-and-death process sampling, or the Jain-Neal split-merge sampler.
We propose to explicitly distinguish between the number of data clusters and components and purposely allow for more components than data clusters. We extend the standard approach by including priors on the number of components and on the Dirichlet parameter. This allows us to approximate the posteriors of the number of components as well as data clusters using Gibbs sampling techniques. The performance of the proposed sampling technique is compared to previously proposed approaches. The additional flexibility gained by suitably selecting the parameters of the hyperpriors is highlighted and guidance for their choice provided.
|