Abstract:
|
RNA-Seq has emerged as a revolutionary technique for gene expression profiling. In this technology,the expression of each gene in a sample is usually summarized by the count of the number of reads mapped to that gene in that sample. One common problem of statistical interest is to cluster the samples into biologically relevant subgroups based on this count data.Existing methods for clustering gene expression data have two drawbacks here. First, because they do not explicitly model the counts, the correlation matrix may be unduly influenced by the large number of zeros and low counts in the data. Second,technical effects or artifacts (such as batch effects) often have a strong effect on the data, leading to clusters quite different from the true biological clusters.In this work, we investigate a different clustering technique based on the admixture model for Poisson data that can control for batch effects to extract the true biological clusters. We perform extensive simulation study to show that our estimated model indeed corresponds closely to the true model, while
clustering algorithms that do not take into account the batch effect, fail completely.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.