Abstract:
|
Statistical inference of taxa ("species") observed in large-scale 16S metagenomic surveys is of considerable biomedical interest. The large number of species thus discovered and excess zeroes in the species count distributions, however, make it a challenge for performing statistical analyses. In this work, we mitigate these issues by aggregating counts of carefully chosen species that behave similarly to latent ecological factors and environmental processes. It is well known that the relative abundances of such ecologically equivalent/nearly-equivalent species are not necessarily influenced by changes in environmental conditions across local and regional scales, but their summed total abundance, however, is (Hubbell, 2001, Leibold & McPeek, 2006). We construct a Bayesian nonparametric model, and two posterior inference algorithms based on Gibbs sampling and Collapsed Variational Inference that ultimately yield a reduced dataset with these taxa clusters as new units of analysis interest (termed "Equivalence Class Units"). Such summaries are better behaved in terms of their distribution characteristics allowing applications of classical statistical procedures.
|