Abstract:
|
Identification of high-order epigenetic interactions among large biomolecules from next generation sequencing datasets (NGS) poses considerable challenges due to high-dimensionality of feature space and heterogeneity of human genome. Through extensive and realistic simulations, we have developed a method to detect biologically meaningful local, high-order interactions from these datasets in a stable fashion. Our method, iterative Random Forests (iRF), iteratively grows a sequence of feature weighted Random Forests, and searches for high-order interactions by analyzing feature usage on the decision paths of large, pure leaf nodes in the tree ensemble. In this work, we study the properties of iRF on a biologically inspired novel class of locally sparse, nonlinear and non-smooth models. We analyze both prediction and feature selection properties of iRF and propose principled guidelines to assess estimation stability of the selected features and their interactions.
|