Abstract:
|
We are interested in estimating the relative frequencies in a multinomial distribution when the "distribution" of relative frequencies is strongly skewed, so there are many scarce classes and a few abundant ones, and the sample size is not large relative to the number of classes. This setting is encountered in SAGE, where the multinomial variates are the counts of "distinct tags," ten base-pair sequences corresponding to mRNA transcripts, in a biological sample. Here, MLEs are not optimal for scarce classes, and standard Bayesian estimators have very high MSE for the abundant ones. We develop a new Bayesian estimation procedure using a Stratified Dirichlet prior, which partitions the classes into two strata, called scarce and abundant, each with its own multivariate prior distribution. Our estimators automatically constrain the multinomial probabilities to sum to one, and incorporate a form of nonlinear shrinkage, yielding estimates close to the MLEs for classes with large counts, but shrunken estimates for classes with small counts. We demonstrate by simulation from a SAGE-like population that our method has smaller IMSE than either the MLE or standard Bayesian estimator.
|