Abstract:
|
Integrative analysis of large, heterogeneous datasets poses a central challenge in many areas of science. Tools exist to detect important main effects and low-order interactions between pairs or small subsets of parameters; however, the detection of nonlinear, high-order interactions from real-world sample sizes has remained fundamentally unsolved. Through extensive and realistic simulation, we developed a method for detecting interactions of high-order in low-sample regimes based on Random Forests (RF) - with an order-zero increase in computational cost over the base algorithm. We regularize RF using soft dimension reduction and adaptive iterative refitting, and then decode the fitted data representation by analyzing feature usages in decision-paths. We call our approach, ``iterative Random Forests'', or iRF, and the general class of algorithm ``Introspective Learning'' to connote the importance of self-interrogation followed by iteration. We demonstrate the usefulness of iRF in two motivating studies: modeling enhancer sequences in Drosophila Melanogaster, and identifying chromatin-RNA interactions at alternatively spliced exons in human cells. In both settings, iRF has similar or better predictive power compared to existing approaches, and provides new insights into relationships among the features. Current challenges in the biosciences motivated the development of iRF, and the algorithm is applicable to any prediction problem in which features are well defined.
|