Abstract:
|
In analyzing gene expression data from a spectrum of experiments, we have been using "traditional" statistical tests such as the t-test, correlation coefficient analysis, or one-way and two-way analysis of variance, as well as less standard statistics as measures of significance. Using specific examples from Aventis stem cell and oncology research, we shall present some of the modifications necessary to achieve adequate power in the context of gene expression analysis; in particular, to allow for the large data sets analyzed (up to 75,000 tests conducted simultaneously, often with low degrees of freedom) and to allow for correlations and/or lack of consistency between samples. The role of a priori noise models will be discussed, with an overall emphasis on maximizing sensitivity at a given false-discovery rate. We will also cover supervised classification of either genes or biological samples, based on their expression profiles, using the k-nearest neighbor method, with an emphasis on the estimation of error rates through internal and biological cross-validation, and on the use of preliminary statistical filters to optimize ultimate classifier performance.
|