Abstract:
|
Modern big data analytics often involves high-dimensional count data sets such as bag-of-words data, ecological survey data, and biological sequence data. A main goal in the analysis of such data is to uncover the group structure in the samples as well as identify the discriminating features. We propose a Bayesian nonparametric hierarchical Poisson mixture model that accounts for the overdispersion observed across samples as well as across multiple features. Our model formulation incorporates a feature section mechanism and prior distributions that appropriately account for identifiability constraints on the model parameters. Our strategy for posterior inference results in a unified approach to achieve the goal. We demonstrate the performance of our method on simulated data and present applications to document clustering, based on a bag-of-words benchmark data set, and to document classification, with an analysis of the Federalist Papers.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.