Abstract:
|
Identifying genetically related markers to a candidate pool based on sequencing data has been playing critical roles in genetic and genomics studies for treating cancer, controlling experiment quality, and learning evolutionary pathways. The complex dependency and leptokurtic nature of the sequencing data, however, make the conventional statistical approaches unreliable due to the loss of controlling the false discovery rate or compromising the empirical power. Motivated by [Fan et al., 2017], we consider a factor adjusted linear mixed model by which we can identify genetically related markers through testing linear hypotheses like contrasts. In particular, we propose a robust multiple testing procedure to handle these heavy-tailed high dimensional data. The proposed method can simultaneously resolve the difficulties brought from the inter-gene dependence and heavy-tailedness among data. We demonstrate that our procedure can improve power and control FDP when the data is generated by heavy-tailed distributions both theoretically and numerically. We apply our procedure to Guppy RNA sequencing data from Fischer et al. (2018) and reveal interesting biological insights.
|