An effective feature selection method for web spam detection

作者:

Highlights:

摘要

Web spam is an illegal and immoral way to increase the ranking of web pages by deceiving search engine algorithms. Therefore, different methods have been proposed to detect and improve the quality of results. Since a web page can be viewed from two aspects of the content and the link, the number of extracting features is high. Thus, selection of features with high separating ability can be considered as a preprocessing step in order to decrease computational time and cost. In this study, a new backward elimination approach is proposed for feature selection. The main idea of this method is measuring the impact of eliminating a set of features on the performance of a classifier instead of a single feature which is similar to the sequential backward selection. This method seeks for the largest feature subset that their omission from whole set features not only reduces the efficiency of the classifier but also improves it. Implementations on WEBSPAM-UK2007 dataset with Naïve Bayes classifier show that the proposed method selects fewer features in comparison with other methods and improves the performance of the classifier in the IBA index about 7%.

论文关键词:Web spam,Feature selection,Content-based features,Link-based features,Unbalanced data,Index of balanced accuracy (IBA)

论文评审过程:Received 18 May 2018, Revised 20 December 2018, Accepted 21 December 2018, Available online 31 December 2018, Version of Record 23 January 2019.

论文官网地址:https://doi.org/10.1016/j.knosys.2018.12.026