Abstract:
|
The coevolution between human and bacteria colonizing the human body has profound implications for heath and development. Dimension reduction tools are routinely applied to multivariate abundance data to visualize broad trends of how similar or different microbial communities are. Yet the analysis of microbiome data is complicated by several statistical challenges. In particular, microbiome data produced by high-throughput sequencing are count-valued, correlated, high-dimensional, over-dispersed with excess zeros, and compositional. To overcome these challenges, we introduce a general framework called Zero-Inflated Probabilistic PCA (ZIPPCA) by extending probabilistic PCA from the Gaussian setting to multivariate abundance data. We propose empirical Bayes approaches to microbiome data ordination under a negative binomial ZIPPCA model, and for inferring microbial compositions under a logistic normal multinomial ZIPPCA model. We develop efficient variational approximation algorithms for estimation, inference, and prediction. We demonstrate the performance of the proposed methods on two real microbiome data sets.
|