Abstract:
|
Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across some features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given some informative genes. However, it is typically unknown a priori what genes may be informative in discriminating clusters, and what the optimal number of clusters is. Also, no method exists for unsupervised clustering of RNA-seq samples that can adjust for between-sample global normalization factors and potential confounding variables, while selecting cluster-discriminatory genes and clustering subjects. To address this, we propose Feature Selection and Clustering of RNAseq (FSCseq): a model-based clustering algorithm that uses a finite mixture of regression model and a quadratic penalty method with a SCAD penalty. Maximization is done by penalized EM, allowing us to include normalization factors and confounders in our model. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership. Based on our analyses, we show the utility of our method.
|