JSM 2016

‹‹ Go Back

Matthew W. Mitchell

Metabolon

‹‹ Go Back

Jacob E. Wulff

Metabolon

‹‹ Go Back

Philip R. Gunst

Metabolon

â€¹â€¹ Go Back

←Back

545 – Variable Selection and Risk Prediction in Genomics

Random Forest for Paired Data

Sponsor: Section on Statistics in Genomics and Genetics

Keywords: random forest, matched pairs, -omics sciences, metabolomics

Matthew W. Mitchell

Metabolon

Jacob E. Wulff

Metabolon

Philip R. Gunst

Metabolon

Random forest classification is a supervised method that has many advantages over other multivariate methods: it is non-parametric, it is invariant to transformation, and it does not overfit the data, requires no variable selection, and it is fairly easy to implement in R. In particular, it works well with data from the -omics sciences such as genomics and metabolomics where the number of variables (p) is much greater than the number of subjects (n), i.e., where "p >> n." The out-of-bag error (OOB error) is a good estimate of future performance. However, when the data consists of matched-pairs, such as cancerous and benign tissue from the same subject or time course data, the OOB-error can be severely pessimistic, especially when the intra-subject correlation is very high. In some cases the OOB-error is 100%, indicating perfect misclassification, when the true misclassification is much lower. Additionally, with the computations of variable importance, noise variables with high intra-subject correlation rank lower than those with low intra-subject correlation. We perform an extensive simulation study in order to compare cross-validation techniques for improving the estimate of the error; and we compare different sampling techniques when building the forest to improve the estimate of the error, as well as improve the predictive ability. We also compare the methods on a human metabolomics study. Computing the residuals for each subject performed the best, but has problems with practical application. Sampling by subject performed well, but was comparable to the standard random forest. Leaving one-subject-out cross-validation corrects the bias of the out-of-bag error.

View paper

Matthew W. Mitchell

Jacob E. Wulff

Philip R. Gunst

Please enter your access key

Email This Presentation:

Random Forest for Paired Data