Large-scale gene expression studies are becoming increasingly common as new microarray technology makes it possible to capture the gene expression profiles for thousands of genes at once. Statistical inference with such high dimensional data structures (and, all too often, relatively small samples) is a challenging analytical problem.
Firstly, we address multiple testing and provide closed-.form optimal multiple testing procedures at given alternatives. Subsequently, we provide a formal statistical framework for subsetting and clustering involving: 1) defining the clustering/subset parameter of interest; 2) consistency of its estimate in the context that the sample size divided by the logarithm of the number of genes converges to infinity; 3) finite sample-size formula for uniform precision; 4) bootstrap to establish reproducibility of clusters and corresponding visual cluster probability plots. We introduce a new partitioning and hierarchical clustering algorithm defining biological parameters of interest in the context that clusters of interest are relatively small to the number of genes.
|