Online Program Home
My Program

Abstract Details

Activity Number: 288 - Genomical Is the New Astronomical: Big Data Algorithms and Applications in Genomics
Type: Topic Contributed
Date/Time: Tuesday, July 31, 2018 : 8:30 AM to 10:20 AM
Sponsor: Section on Statistical Computing
Abstract #329627 Presentation
Title: Analyzing Large Scale Genomics Data with Apache Spark and ADAM
Author(s): Frank Nothaft*
Companies: Databricks
Keywords: Distributed computing; Genomics; Apache Spark; Cloud computing; Machine learning
Abstract:

The decreased cost of sequencing a single genome has driven a massive increase in the amount of sequencing data generated. While there has been recent interest in building software frameworks that can scale to run population/statistical genetics analyses on increasingly large cohorts of finalized genotypes, expansive cohorts of NGS data also provide novel opportunities to improve the quality of insights derived directly from read data, such as point and structural variant calls, and RNA expression levels.

In this talk, we will discuss ADAM and the Big Data Genomics project, which is a set of tools for performing genomics analyses using the popular Apache Spark framework. Apache Spark is a framework for running computational tasks that are distributed across a large cluster. Spark is amenable to running on cloud computing. We will focus on how ADAM's programming abstractions enable cohort-scale machine learning on thousands of samples of genomic read data, and we will also look at how ADAM can be used to jointly integrate read-derived likelihoods across large cohorts.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program