A hexanucleotide repeat expansion in chromosome 9 open reading frame 72 (C9orf72) is one of the most common genetic causes of amyotrophic lateral sclerosis (ALS). Biomarkers based on whole blood RNAseq data that differentiate C9orf72 ALS subjects may help in selecting the right patients for ALS targeted therapy. We aimed to find an appropriate statistical classifier under two major challenges: high dimensionality of RNAseq data and class-imbalanced outcome due to the rarity of C9orf72 mutation in ALS.
We developed a simulation framework to examine the performance of three classifiers: penalized support vector machine, lasso logistic regression and random forest under different class balancing strategies. RNAseq libraries were simulated based on the RNAseq data of brain tissues from public domain and the RNAseq data of whole blood from an internal clinical trial with various imbalance ratios.
Simulation studies showed that balancing strategies improved the performance of the three classifiers differently, depending on the imbalance ratio and the separation of classes. In particular, penalized support vector machine with undersampling strategy performed the best for our problem.
|