Comparison of Unbalanced Data Methods for Support Vector Machines

Creative Commons License

Akın P., Terzi Y.

Türkiye Klinikleri Biyoistatistik Dergisi, vol.13, no.2, pp.138-146, 2021 (Peer-Reviewed Journal) identifier


Objective: The biggest problem we encounter when applying classification algorithms is that the classification categories are not equally distributed. Eight different re-sampling methods were used for balancing the dataset. Material and Methods: Support Vector Machines (SVM) were used to compare these methods. SVM are supervised learning models with associated learning algorithms that analyze data used for categorization and regression analysis. The main function of the algorithm is to find the best line, or hyperplane, which divides the data into two classes. SVM is basically a linear classifier that classifies linearly separable data, but, in general, the feature vectors might not be linearly separable. To overcome this issue, what is now called kernel trick was used. Results: This article presents a comparative study of different kernel functions (linear, radial, and sigmoid) for unbalanced data. The myocardial infarction dataset which was taken from the Github were classified by 10-fold cross validation to increase the performance. Accuracy, sensitivity, specificity, precision, g-mean and F score were used for comparing the methods. The analysis was carried out by R software. Conclusion: As a conclusion, the results of performance metrics for the original data increased through random over sampling examples re-sampling methods for linear and sigmoid kernel functions. Smote method performed better in the case of radial kernel. In general, the unbalance in the data in classification algorithms gives biased results and this should be eliminated.