Abstract:
|
In microbiome studies, taxa composition is often estimated based on the sequencing read counts in order to account for the large variability in the total number of observed reads across different samples. Due to sequencing depth, some rare microbial taxa might not be captured in the metagenomic sequencing, which results in many zero read counts. Naive composition estimation using count normalization therefore lead many zero proportions, which underestimates the underlying compositions, especially for the rare taxa. In this paper, the observed counts are assumed to be sampled from a multinomial distribution, with the unknown composition being the probability parameter in a high dimensional positive simplex space. Under the assumption that the composition matrix is approximately low rank, a nuclear norm regularization-based likelihood estimation is developed to estimate the underlying compositions of the samples. The theoretical upper bounds and the minimax lower bounds of the estimation errorsmeasured by the Kullback-Leibler divergence and the Frobenius norm are established. Simulations and real data analysis will be presented.
|