Abstract:
|
RNA sequencing of both bulk and, more recently, single cells, have become the method of choice for measuring gene expression. The data from these assays are often summarized by counts of the number of reads mapping to different genes. One common analysis step is to cluster the samples, usually using a hierarchical clustering method. Here we explore an alternative: a model-based clustering method, "latent Dirichlet Allocation", previously developed for Natural Language Processing (Blei, Ng and Jordan 2003), that takes account of the count nature of the data. This model, like the admixture model in population genetics (Pritchard, Stephens and Donnelly 2000), allows that each sample may belong to more than one cluster. We suggest different ways to visualize results, and implement methods to identify genes whose expression characterizes each cluster. We illustrate the performance of the method by applying it to both the Genotype Tissue Expression (GTEx) Project bulk-RNA data, and to single cell RNA-seq datasets. We also discuss the importance of dealing with batch effects in such data. Building on the maptpx package (Taddy 2014), our methods are implemented in an R package CountClust.
|