Abstract:
|
Understanding how high-order interactions among features in supervised learning presents a substantial statistical challenge. Building on RFs, Random Intersection Trees (RIT), and extensive and biologically inspired simulations, we developed iterative Random Forests (iRFs). iRFs train a feature-weighted ensemble of decision trees to detect stable, high-order interactions with a similar computational cost as RF. iRF is demonstrated for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identify as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Novel third-order interactions suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin- mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation.
|