Abstract:
|
The age-old wisdom "garbage in, garbage out" underscores any analysis using next-generation sequencing (NGS) data. Pipeline components are concatenated serially with only minimal transmission of uncertainties and information. For example, base callers rarely utilize information about the underlying genome sequence, whereas error correction methods seldom utilize the error properties of sequencing. We demonstrate that integrated, probabilistic approaches that combine steps in the pipeline perform better than sequential analysis. Others have improved pipeline operations by borrowing information from alignment to known reference genome(s). Our combined approach specifically capitalizes on genome information, but without use of a known reference genome to avoid biasing against the unknown. We use a Hidden Markov Model on a sparse de Bruijn graph, where the transitions model genetic content and the emissions model observable data. The combined probabilistic approach removes more errors and more accurately transmits information through the pipeline.
|