The decreased cost of sequencing a single genome has driven a massive increase in the amount of sequencing data generated. While there has been recent interest in building software frameworks that can scale to run population/statistical genetics analyses on increasingly large cohorts of finalized genotypes, expansive cohorts of NGS data also provide novel opportunities to improve the quality of insights derived directly from read data, such as point and structural variant calls, and RNA expression levels.
In this talk, we will discuss ADAM and the Big Data Genomics project, which is a set of tools for performing genomics analyses using the popular Apache Spark framework. Apache Spark is a framework for running computational tasks that are distributed across a large cluster. Spark is amenable to running on cloud computing. We will focus on how ADAM's programming abstractions enable cohort-scale machine learning on thousands of samples of genomic read data, and we will also look at how ADAM can be used to jointly integrate read-derived likelihoods across large cohorts.