Identify gene-gene interactions using two-stage machine learning approach*Hui-Yi Lin, Moffitt Cancer Center & Research Institute
Keywords: cancer, machine learning
A growing number of studies show that interactions of single nucleotide polymorphisms (SNP) are important for understanding causes of complex diseases. Among a large number of SNPs, it is impractical and less powerful to search all possible interactions using conventional statistical approaches (such as logistic regression). The two-stage approach (TRM), which combines two machine learning methods - Random Forests (RF) and Multivariate Adaptive Regression Splines (MARS), was applied to detect interaction patterns. RF is first applied to detect a predictive subset of SNPs, and then MARS is used to identify interaction patterns among the selected SNPs. There is an urgent need for finding biomarkers to predict prostate cancer prognosis for selecting a suitable treatment plan. We used this TRM approach for identifying SNP interactions associated with prostate cancer aggressiveness. We examined 2,653 SNPs in the 161 androgen receptor genes for 1,151 prostate cancer cases. A total of 40 SNPs were selected for further analyses in MARS. The bootstrap method was applied for variable significance validation. In the final model, two main effects (rs1477908 and rs3093040) and one 2-way interaction (rs1477908*rs1387665, p=0.0002) were identified. The results suggest using the TRM approach can successfully identify interaction patterns in studies with a large number of SNPs.