Addressing Class Imbalance in Fetal Health Classification: Rigorous Benchmarking of Multi-Class Resampling Methods on Cardiotocography Data

Hawrami, Zainab; Cengiz, Mehmet; Dünder, Emre

doi:10.3390/diagnostics16030485

Addressing Class Imbalance in Fetal Health Classification: Rigorous Benchmarking of Multi-Class Resampling Methods on Cardiotocography Data

Hawrami Z. S. M., Cengiz M. A., Dünder E.

DIAGNOSTICS, cilt.16, sa.3, ss.1-28, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 16 Sayı: 3
Basım Tarihi: 2026
Doi Numarası: 10.3390/diagnostics16030485
Dergi Adı: DIAGNOSTICS
Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), EMBASE, Directory of Open Access Journals
Sayfa Sayıları: ss.1-28
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Ondokuz Mayıs Üniversitesi Adresli: Evet

Özet

Fetal health is essential in prenatal care, influencing both maternal and fetal outcomes. Cardiotocography (CTG) monitors uterine contractions and fetal heart rate, yet manual interpretation exhibits significant inter-examiner variability. Machine learning offers automated alternatives; however, class imbalance in CTG datasets where pathological cases constitute less than 10% leads to poor detection of minority classes. This study aims to provide the first systematic benchmark comparing five resampling strategies across seven classifier families for multi-class CTG classification, evaluated using imbalance-aware metrics rather than overall accuracy alone. Methods: Seven machine learning models were employed: Naïve Bayes (NB), Random Forest (RF), Linear Discriminant Analysis (LDA), k-Nearest Neighbors (KNN), Linear Support Vector Machine (SVM), Multinomial Logistic Regression (MLR), and Multi-Layer Perceptron (MLP). To address class imbalance, we evaluated the original unbalanced dataset (base) and five resampling methods: SMOTE, BSMOTE, ADASYN, NearMiss, and SCUT. Performance was evaluated on a held-out test set using Balanced Accuracy (BACC), Macro-F1, the Macro-Matthews Correlation Coefficient (Macro-MCC), and Macro-Averaged ROC-AUC. We also report per-class ROC curves. Results: Among all models, RF proved most reliable. Training on the original distribution (base) yielded the highest BACC (0.9118), whereas RF combined with BSMOTE provided the strongest class-balanced performance (Macro-MCC = 0.8533, Macro-F1 = 0.9073) with a near-perfect ROC-AUC (approximately 0.986–0.989). Overall, resampling effects proved model dependent. While some classifiers achieved optimal performance on the natural class distribution, oversampling techniques, particularly SMOTE and BSMOTE, demonstrated significant improvements in minority class discrimination and class-balanced metrics across multiple model families. Notably, certain models benefited substantially from resampling, exhibiting enhanced Macro-F1, BACC, and minority class recall without sacrificing overall accuracy. Conclusions: These findings establish robust, model-agnostic baselines for CTG-based fetal health screening. They highlight that strategic oversampling can translate improved minority class discrimination into clinically meaningful performance gains, supporting deployment in cost-sensitive and threshold-aware clinical settings.