Abstract:
|
Large scale cancer genome projects have sequenced tens of thousands of tumor genomes. Millions of somatic variants in the coding sequence of the cancer genome have been detected, with a preponderance of rare variants appearing only once or twice. It is extremely common in practice when sequencing a new patient’s tumor to encounter a new variant that has not been previously seen. In this study we focus on two quantitative goals: estimating for a new tumor the probability of observing a variant that has never previously been observed; and estimating the total number of variants that have not yet been observed. We draw upon statistical methodology that has been developed in other fields of study, notably in species estimation in ecology, and word frequencies in computational linguistics. These methods are applied to the TCGA dataset encompassing whole-exome sequencing of 10,000 tumor genomes and validated on a clinical cohort of 10,000 tumors sequenced by a targeted cancer gene panel. Some of the major findings from this study will be discussed.
|