Abstract:
|
RNA-Seq data are increasingly used for whole-genome differential mRNA expression analysis in lieu of gene expression arrays such as those from Affymetrix and Illumina. Because the raw data in RNA-Seq consist of counts of fragments mapping to each gene or exon, and because the counts are over-dispersed, it is common to model the distribution as negative binomial. Yet empirically methods based on the negative binomial generate often massively inflated false positives whether real data are used or simulated negative binomial data. This appears to be a consequence of the fact that the negative binomial with unknown scale is not an exponential family distribution, and that as a quasi-likelihood the link function, and thus the natural parameter, are functions of the scale parameter. Consequently also, a linear model with negative binomial quasi-likelihood is not a proper generalized linear model unless the scale is known. We demonstrate that, even when the data are truly negative binomial, it is better to use transformation or weighting followed by standard linear models than it is to fit a version of a generalized linear model with estimated scale.
|