Abstract:
|
Common features of modern compositional count data include (i) high-dimensionality—with a large number of compositional categories, (ii) sparsity—only a small fraction of such categories are present in any single sample, and (iii) large and heterogeneous cross-sample variability. We introduce a class of probability models based on the Dirichlet-tree (DT) that aims at incorporating these important features while striking a balance between two classical model classes for compositional data—the Dirichlet model, which is simple but restrictive, and the log-ratio based models, which are flexible but can incur difficulties in inference and result in overfitting without additional care. We demonstrate that the simple idea of the DT, which in its simplest form decomposes a multinomial into multiple binomial models organized over a dyadic tree structure, can be adapted for a variety of inference purposes including cross-sample comparison, regression analysis, and latent structure learning. We demonstrate the use of this model in the context of microbiome analysis, where a natural tree structure is available based on the evolutionary relationships of microbial species.
|