Abstract:
|
Gene expression data has certain inherent characteristics that limit the applicability of classical clinical research methods: 1) Being a new science, there is very little a priori knowledge of variance, required for traditional sample size calculations. 2) Genes are highly interdependent, working together to perform physiological processes, confounding regression estimates. 3) Similarly, conventional univariate power calculations can be highly inaccurate for multifactorial gene expression data. 4) Realistically, a typical study dataset stores expression levels for 8000 to 16000 genes for each of, say, 50 patients at most; thus, the number of explanatory variables (genes) grossly exceeds the sample size, violating a basic epidemiological design principle. 5) Among such a large number of comparisons, probably about 5% will be Type I errors. 6) Each sample costs up to $500: economically, we need a minimum sample size with maximum power to detect differences, even between weakly expressed genes.
To address these problems, we developed a method for estimating sample size based on dimension reduction and discriminant scores derived from the variance ratio. We validated the results.
|