Keywords: microbiome, statistical learning, machine learning, batch effects, sequencing
The composition of a microbiome is an important parameter to estimate given the critical role that microbiomes play in human and environmental health. However, profiling the composition of a microbial community using high throughput sequencing methods distorts the true composition of the community. Sequencing mock communities -- artificially constructed microbiomes of known composition -- clearly illustrates that observed composition is a biased estimate of true composition, with certain taxa consistently overobserved or underobserved compared to their true relative abundance. We propose a statistical learning model for bias in compositional data, illustrating its performance on data from the Vaginal Microbiome Consortium. We show how our model can be used to correct for batch-specific biases, permitting meta-analysis of microbiome studies.