Abstract:
|
Classification of imbalanced datasets is one of the biggest challenges encountered in data mining. Class imbalance severely compromises the process of model learning since classifiers tend to be biased towards the prevalent class. Additionally, the evaluation of a model's accuracy is jeopardized due to the dearth of data. Simulation studies are used to analyze three re-sampling algorithms (over-sample, under-sample, SMOTE) and several different evaluation metrics for assessing the effectiveness of a classifier in imbalanced data. The results suggest that model evaluation metrics may reveal more about the distribution of classes than they do about the actual performance of models when the data are imbalanced. Additionally, some of the classification models were identified to be very sensitive to imbalance and perform poorly in such cases. The final decision in model selection should consider a combination of different metrics instead of relying on only one. To avoid or minimize imbalance-biased performance estimates, we recommend reporting both the obtained metric values and the degree of imbalance in the data.
|