International Journal of Scientific Research in Computer Science, Engineering and Information Technology, cilt.11, sa.3, ss.861-874, 2025 (Hakemli Dergi)
Diabetes, a pervasive metabolic disorder, presents a significant global health challenge, with its escalating prevalence contributing to millions of annual fatalities. This study aims to develop machine learning techniques to enhance the accuracy of early diabetes diagnosis and to address the class imbalance in medical datasets, which often skews machine learning model performance. The research question focuses on the effectiveness of machine learning techniques and class imbalance rectification methods in improving diabetes prediction accuracy. Using the Pima Indians Diabetes Database, this study applies the Instance Hardness Threshold (IHT) undersampling technique and the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalances. Data preprocessing steps include imputing missing values, rebalancing the dataset, normalizing data, and selecting relevant features. Recursive Feature Elimination identifies vital features such as glucose, BMI, skin thickness, insulin, diabetes pedigree function, and age. Decision Trees, K-Nearest Neighbour, Support Vector Machines, Random Forest, and Artificial Neural Networks classification algorithms have been evaluated and compared. The Random Forest algorithm emerges as the most effective, achieving an accuracy of 97.21%, precision of 95.5%, recall of 98.8%, F1-score of 97.1%, and AUC of 99.0% with the IHT undersampling technique. Other algorithms also show enhanced performance metrics following the application of resampling techniques. The study underscores the importance of addressing class imbalances in datasets for diabetes prediction. The proposed framework enhances early diabetes diagnosis and contributes to improved healthcare efficiency. Future research should explore larger datasets and incorporate advanced deep-learning methods to refine predictive performance further.