Spam Filtering with KNN: Investigation of the Effect of k Value on Classification Performance


Şahin D. Ö., Demirci S.

28th Signal Processing and Communications Applications Conference (SIU), ELECTR NETWORK, 5 - 07 Ekim 2020 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Basıldığı Ülke: ELECTR NETWORK
  • Anahtar Kelimeler: e-mail classification, spam filtering, nearest neighborhood, term frequency, inverse document frequency, chi-square feature selection
  • Ondokuz Mayıs Üniversitesi Adresli: Evet

Özet

In this study, it is aimed to filter spam e-mails by using machine learning and text mining techniques. K-Nearest Neighbor (KNN) algorithm which is one of the techniques of machine learning is used. KNN algorithm is an easy to use and high performance classification algorithm. But the main problem of this algorithm is what will be the k value at the beginning. The performance of the algorithm changes according to the selected k value. In this study, three different data sets are discussed. These are Enron, Ling-Spam and SMSSpam-Collection data sets. Firstly, basic text mining techniques and term frequency-inverse document frequency (TF-IDF) term weighting method are applied to all data sets. By, according to the Chi-Square feature selection method, the best 500 attributes are selected and given to KNN algorithm. Finally, extensive experiments are carried out by giving the values of 1, 3, 5, 7 and 9 to the k value of the algorithm. In all three data sets, the most successful result is obtained when k is 1. The most successful results obtained from Ling-Spam, Enron and SMS-Spam-Collection data sets according to F-measure are 0.9324, 0.9215 and 0.9196 respectively.