Abstract:
|
Single cell RNA sequencing (scRNA-seq) is a recently developed technology that allows quantification of RNA transcripts at individual cell level, providing cellular level resolution of gene expression variation. The scRNA-seq data are counts of RNA transcripts of all genes in species' genome. We adapt the Latent Dirichlet Allocation (LDA), a generative probabilistic model originated in natural language processing (NLP), to model the scRNA-seq data by considering genes as words and cells as documents, and latent biological functions as topics. In LDA, each documents is considered as the result of words generated from a mixture of topics, each with a different word usage frequency profile. We propose a penalized version of LDA to reflect the structure in scRNAseq, that only a small subset of genes are expected to be topicspecific. We apply the penalized LDA to two scRNA-seq data sets to illustrate the usefulness of the model. Using inferred topic frequency instead of word frequency substantially improves the accuracy in cell type classification.
|