Activity Number:
|
306
- SPEED: SPAAC SESSION II
|
Type:
|
Topic-Contributed
|
Date/Time:
|
Wednesday, August 11, 2021 : 3:30 PM to 5:20 PM
|
Sponsor:
|
Biometrics Section
|
Abstract #317854
|
|
Title:
|
A Sparse Negative Binomial Mixture Model for Clustering RNA-Seq Count Data
|
Author(s):
|
YUJIA LI* and Tanbin Rahman and Tianzhou Ma and Lu Tang and George Tseng
|
Companies:
|
Department of Biostatistics, University of Pittsburgh and Department of Biostatistics, MD Anderson Cancer Center and Department of Epidemiology and Biostatistics, University of Maryland and University of Pittsburgh and Department of Biostatistics, University of Pittsburgh
|
Keywords:
|
cluster analysis;
Gaussian mixture model;
sparse K-means;
feature selection
|
Abstract:
|
Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse K-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with a Gaussian assumption. In this paper, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small n) with high-dimensional gene features (large p). A modfied EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows superior performance of the proposed count data model in clustering accuracy, feature selection and biological interpretation in pathways.
|
Authors who are presenting talks have a * after their name.
Back to the full JSM 2021 program
|