Abstract:
|
High-throughput sequencing technology provides unprecedented opportunities to quantitatively explore human gut microbiome and its relation to diseases. Microbiome data are compositional, sparse, noisy, and heterogeneous, which pose serious challenges for statistical modeling. We propose a Bayesian multinomial matrix factorization model to infer overlapping clusters on both microbes and human hosts. The proposed method represents the observed over-dispersed zero-inflated count matrix as Dirichlet-multinomial mixtures on which the latent cluster structures are built hierarchically. Under the Bayesian framework, the number of clusters is automatically determined and available information from a taxonomic rank tree of the microbes is incorporated, which greatly improves the interpretability of the findings. We demonstrate the utility of the proposed approach using simulations and an application to a human inflammatory bowel disease microbiome dataset. The application reveals interesting clusters, some of which contain known bacteria that are related to the disease, supported by existing biological literature.
|