Abstract:
|
Microbiome data record the relative abundances of microbial entities, called operational taxonomic units (OTUs), that are present in various environments. Analyzing such data is challenging for three main reasons: (i) they are compositional, i.e., the absolute abundances are not known; (ii) they are high-dimensional, i.e., there are a large number of microbes; and (iii) there is a high level of sparsity, i.e., microbes are generally not present in very many samples. Many existing methods carefully address the first two challenges, but then take an ad hoc approach to address the challenge of data sparsity. For example, authors sometimes manually aggregate OTUs to the genus or family level and/or simply filter out any microbial entities that are rare. We propose instead a principled regression framework that addresses all three challenges. In particular, our method makes use of phylogenetic information to automate the aggregation process in a data-driven manner. We show that our approach leads to superior performance relative to pre-existing methods on microbiome data.
|