DatRel: a noise-tolerant data relocation approach for effective synthetic data generation in imbalanced classifiers


Creative Commons License

Sağlam F.

MACHINE LEARNING, sa.5, 2025 (SCI-Expanded, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1007/s10994-025-06755-8
  • Dergi Adı: MACHINE LEARNING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, PASCAL, Aerospace Database, Applied Science & Technology Source, Communication Abstracts, Compendex, Computer & Applied Sciences, INSPEC, Linguistics & Language Behavior Abstracts, Metadex, zbMATH, Civil Engineering Abstracts
  • Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
  • Ondokuz Mayıs Üniversitesi Adresli: Evet

Özet

Most machine learning algorithms tend to bias towards the majority class when a dataset exhibits a skewed distribution in the class variable. This is called the class imbalance problem and is frequently encountered in real-life applications. One of the most prevalent methods for addressing class imbalance is data resampling, which generates or removes samples to balance the dataset. A well-known issue with oversampling is noise generation. Noise removal or hybrid resampling is used to deal with noise. However, these methods cause imbalance to re-emerge. In this study, a data relocation approach named DatRel is proposed to address the noise generation problem of oversampling without causing imbalance. The proposed approach utilizes pure and proper class cover catch digraphs (P-CCCD) to determine dominant points and cover areas for minority class. Then, new samples from oversampling are drawn to the dominant points until they are covered. This process ensures that newly generated samples never overlap with a negative sample. Imbalance is not affected since no sample is removed by undersampling. The proposed DatRel approach is applied to commonly used oversampling methods, namely SMOTE, ADASYN, and BLSMOTE. Moreover, the performance of the DatRel approach is compared to noise filtering methods such as Tomeklink, ENN, NEATER, and NearMiss after SMOTE. Several baseline classification algorithms are employed, and comparisons are made using various metrics. Results using 49 imbalanced datasets show that DatRel improves classifier performance in oversampling methods and demonstrates its value in comparison to other noise removal techniques according to AUC, BACC, F1, GMEAN, and MCC.