Abstract:
|
This paper introduces a way to asymptotically control the false discovery rate (FDR) using data splitting. For each feature, the method estimates two independent significance coefficients via data splitting and constructs a contrast statistic. The FDR control is achieved by taking advantage of the statistic's property that, for any null feature, its sampling distribution is symmetric about 0. We further propose a strategy to aggregate multiple data splits to stabilize the selection result and boost the power. Interestingly, this multiple data-splitting approach appears capable of overcoming the power loss caused by data splitting with FDR still under control. The proposed framework is applicable to canonical statistical models including linear models, generalized linear models, and Gaussian graphical models. Simulation results, as well as a real data application, show that the proposed approaches, especially the multiple data-splitting strategy, control FDR well, and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter.
|