Abstract:
|
For the biomarkers developed for neurodegenerative disorders, interested treatment groups or disease subgroups may be heavily overlapped distributions on these biomarkers. Therefore, differentiating patients become very challenging. Moreover, due to the nature of the therapeutic areas of interest, class imbalance and small sample size are also commonly observed. This imposes additional challenges in differentiating patients using these biomarkers. Through a simulation study, classification performance is evaluated among selected machine learning methods on small sample-size overlapping data with class imbalance. Methods include combinations of multiple resampling approaches and classification algorithms along with two workflows applying different cross-validation (CV) strategies. The simulation results suggested that the random under sampling (RUS) is preferred and needed. Comparing to accuracy as the performance metric, geometric mean balances the accuracy of prediction between both classes and it is a more robust performance measurement. Taking both time consuming and prediction performance into consideration, the one-step repeated CV strategy is preferred.
|