Abstract:
|
Random forest classification is a supervised method that has many advantages over other multivariate methods: it is non-parametric, it is invariant to transformation, and it does not overfit the data, requires no variable selection, and it is fairly easy to implement in R. In particular, it works well with data from the -omics sciences such as genomics and metabolomics where the number of variables (p) is much greater than the number of subjects (n), i.e., where "p >> n." The out-of-bag error (OOB error) is a good estimate of future performance. However, when the data consists of matched-pairs, such as cancerous and benign tissue from the same subject or time course data, the OOB-error can be severely pessimistic, especially when the intra-subject correlation is very high. In some cases the OOB-error is 100%, indicating perfect misclassification, when the true misclassification is much lower. Additionally, with the computations of variable importance, noise variables with high intra-subject correlation rank lower than those with low intra-subject correlation. We perform an extensive simulation study in order to compare cross-validation techniques for improving the estimate of the error; and we compare different sampling techniques when building the forest to improve the estimate of the error, as well as improve the predictive ability. We also compare the methods on a human metabolomics study. Computing the residuals for each subject performed the best, but has problems with practical application. Sampling by subject performed well, but was comparable to the standard random forest. Leaving one-subject-out cross-validation corrects the bias of the out-of-bag error.
|