Abstract:
|
Next-generation amplicon sequencing is a powerful tool for investigating microbial communities. One main challenge is to distinguish true biological variants from errors caused by PCR and sequencing. In the traditional analysis pipeline, such errors are eliminated by clustering reads within a sequence similarity threshold, usually 97%, and constructing operational taxonomic units (OTUs). However, the arbitrary threshold can lead to low resolution and high false positive rates. Here, we introduce AmpliCI, a reference-free, model-based method for rapidly resolving the number, abundance and identity of error-free sequences in massive Illumina amplicon datasets. AmpliCI takes into account quality information and allows the data, rather than an arbitrary threshold or an external database, to drive the conclusions. AmpliCI estimates a mixture model, using a greedy strategy to gradually select error-free sequences while approximately maximizing the likelihood. We show that AmpliCI is superior in accuracy to three other popular denoising methods, with acceptable computation time and memory usage.
|