Machine Learning Based Spam E-mail Detection System for Turkish


Eryilmaz E. E., Şahin D. Ö., Kılıç E.

5th International Conference on Computer Science and Engineering (UBMK), Diyarbakır, Turkey, 9 - 11 September 2020, pp.7-12 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/ubmk50275.2020.9219487
  • City: Diyarbakır
  • Country: Turkey
  • Page Numbers: pp.7-12
  • Keywords: e-mail classification, spam filtering, machine learning, Turkish e-mail classification, Turkish spam filtering
  • Ondokuz Mayıs University Affiliated: Yes

Abstract

Electronic mail is a digital letter sent over the internet. All types of files such as documents, pictures, music, videos can be attached to emails and transferred to the recipient's computer. E-mails, which are preferred due to their cheapness and ease, are sent to billions of people every year. Email is an effective way of communication as it saves time and money, hence it has become the most used communication tool in personal and professional communication. Emails are actively used by people or communities who want to make propaganda, advertising, phishing because of their ease of use and low cost. People or communities who want to achieve their goals send unnecessary and unsolicited mail to the e-mail accounts they never knew. These mails cause serious material and moral damages to Internet users and also weaken Internet traffic. Spam e-mail is a method that is sent to the recipient without his consent and that is generally used by malicious or promotional purposes. The purpose of spammers is to encourage computer users to purchase legal or prohibited products and services. Existing spam blocking methods often lag behind innovations that spammers constantly bring, so machine learning-based spam detection methods emerge. In this study, it is provided to detect spam by using 7 different machine learning methods on 800 Turkish e-mail datasets. In the developed method, when the feature selection is made with the chi-square test, the best result is obtained from the Sequential Minimal Optimization (SMO) algorithm. When the feature selection is made with the information gain method, the best result is obtained from the Multi Layer Perceptron (MLP) algorithm. Performance results obtained from SMO and MLP algorithms are 0.985 and 0.984 according to F-measure, respectively.