Online Program Home
My Program

Abstract Details

Activity Number: 545
Type: Contributed
Date/Time: Wednesday, August 3, 2016 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistics in Genomics and Genetics
Abstract #320779 View Presentation
Title: Random Forest for Paired Data
Author(s): Matthew Mitchell* and Jacob Edward Wulff and Philip Ross Gunst
Companies: Metabolon and Metabolon and Metabolon
Keywords: random forest ; matched pairs ; -omics sciences ; metabolomics
Abstract:

Random forest classification is a supervised method that has many advantages over other multivariate methods: it is non-parametric, it is invariant to transformation, and it does not overfit the data, requires no variable selection, and it is fairly easy to implement in R. In particular, it works well with data from the -omics sciences such as genomics and metabolomics where the number of variables (p) is much greater than the number of subjects (n), i.e., where "p >> n." The out-of-bag error (OOB error) is a good estimate of future performance. However, when the data consists of matched-pairs, such as cancerous and benign tissue from the same subject or time course data, the OOB-error can be severely pessimistic, especially when the intra-subject correlation is very high. In some cases the OOB-error is 100%, indicating perfect misclassification, when the true misclassification is much lower. Additionally, with the computations of variable importance, noise variables with high intra-subject correlation rank lower than those with low intra-subject correlation. We perform an extensive simulation study in order to compare cross-validation techniques for improving the estimate of the error; and we compare different sampling techniques when building the forest to improve the estimate of the error, as well as improve the predictive ability. We also compare the methods on a human metabolomics study. Computing the residuals for each subject performed the best, but has problems with practical application. Sampling by subject performed well, but was comparable to the standard random forest. Leaving one-subject-out cross-validation corrects the bias of the out-of-bag error.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

 
 
Copyright © American Statistical Association