Abstract:
|
Large-scale discrete data arise from a wide range of scientific areas, such as text mining, genetics and neurosciences. Motivated by the needs of separating the latent source signals from those discrete data, we propose a general Bayesian framework extending the existing independent component analysis (ICA) approaches to various type of discrete outcomes, including binary, multi-level and ordinal data. In our mode, the discrete nature of the data is captured by performing source signal separation on its distribution. This essentially can be considered as a procedure of decomposing the discrete data in its associated probability vector space. The proposed model enjoys good theoretical properties and the efficient MCMC methods are developed for posterior computation. For high dimensional problems, we develop a fast EM algorithm to search the maximum a posteriori (MAP) estimation. We demonstrate the performance of the proposed approach via extensive simulation studies. We apply our methods to analyze two massive discrete real data sets: the MNIST database of handwritten digits and the SNP data in a genome-wide association study (GWAS) of Alzheimer disease.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.