Abstract:
|
Sequencing homologous genomic regions of samples of highly variable organisms or mixtures of organisms is becoming routine biological practice. The methodology is used to monitor the health or status of living systems from infected individuals to fermentation vats. Though the throughput of modern sequencing technology has enabled the possibility of such data, the technology also introduces an excess of errors, making it difficult to distinguish true biological variation from technical artifact. In many applications, identification of minor biological variation is critical. Minor variants often signal important biological shifts, such as mounting resistance, imminent community collapse, disease onset, and other important conditions that warrant monitoring in the first place. We describe methods to cluster next generation sequencing (NGS) reads while accounting for error properties of the NGS machine. The approach relies on clustering methods for big data and leads to better separation of true variation from errors than existing methods. We demonstrate the approach on mock bacterial communities and samples taken from HIV-infected patients.
|