Abstract:
|
Gene expression measurement has shifted from microarrays to next generation RNA-sequencing, producing ever richer high-throughput data for transcriptomics studies. limma/voom, edgeR, and DESeq2, state-of-the-art methods for analyzing such data, can all fail to control the type-I error. As such studies grow in size and frequency, there is an urgent need for better statistical methods. We model log-normalized RNA-seq counts as continuous variables using nonparametric regression to account for their inherent heteroscedasticity, in a principled, model-free, and efficient manner. We rely on a powerful variance component score test that can deal with both covariates adjustment in complex experimental designs (e.g. longitudinal studies) and data heteroscedasticity, to identify the genes whose expression is significantly associated with one or several factors of interest. Our test statistic has a simple form and limiting distribution, which can be computed quickly. A permutation version of the test is also derived for small sample sizes. Our test exhibits very good statistical properties in simulations, and in two applications (a candidate vaccine against EBOLA, and a tuberculosis study).
|