Online Program Home
My Program

Abstract Details

Activity Number: 434
Type: Contributed
Date/Time: Tuesday, August 2, 2016 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistics in Genomics and Genetics
Abstract #320462 View Presentation
Title: Model-Based Clustering and Visualization of RNA-Seq Data
Author(s): Kushal Dey* and Matthew Stephens
Companies: The University of Chicago and The University of Chicago
Keywords: RNA-seq ; clustering ; topic model ; visualization ; batch effects ; variable selection
Abstract:

RNA sequencing of both bulk and, more recently, single cells, have become the method of choice for measuring gene expression. The data from these assays are often summarized by counts of the number of reads mapping to different genes. One common analysis step is to cluster the samples, usually using a hierarchical clustering method. Here we explore an alternative: a model-based clustering method, "latent Dirichlet Allocation", previously developed for Natural Language Processing (Blei, Ng and Jordan 2003), that takes account of the count nature of the data. This model, like the admixture model in population genetics (Pritchard, Stephens and Donnelly 2000), allows that each sample may belong to more than one cluster. We suggest different ways to visualize results, and implement methods to identify genes whose expression characterizes each cluster. We illustrate the performance of the method by applying it to both the Genotype Tissue Expression (GTEx) Project bulk-RNA data, and to single cell RNA-seq datasets. We also discuss the importance of dealing with batch effects in such data. Building on the maptpx package (Taddy 2014), our methods are implemented in an R package CountClust.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

 
 
Copyright © American Statistical Association