CS-BPSO: Hybrid feature selection based on chi-square and binary PSO algorithm for Arabic email authorship analysis

作者:

Highlights:

摘要

Email authorship analysis is a challenging task involving the detection of an author’s style to help determine their identity. Emails represent a widespread application of big data, and email authorship analysis is widely performed in the forensic linguistics field. However, the high-dimensional feature space encountered in authorship analysis affects the classification performance. Moreover, the Arabic language is highly inflected and involves certain unique characteristics, which pose critical challenges in identifying the context. Therefore, the selection of prominent features is a critical step in realizing authorship analysis. Swarm intelligence (SI) algorithms are widely adopted to address such feature selection problems. In this study, an efficient hybrid feature selection algorithm based on binary particle swarm optimization (BPSO) and chi-square BPSO (CS-BPSO) was developed to enhance the performance of Arabic email authorship analysis. Static and dynamic features were considered. Experiments were conducted on Arabic email messages collected from a sample population to test the algorithm performance using three popular classifiers: support vector machine (SVM), K-nearest neighbour (KNN), and naïve Bayes (NB) classifiers. Different metrics, specifically, the accuracy, precision, recall, and f1-score, were considered as performance measures. The results showed that the CS-BPSO method achieves impressive results using dynamic features. The findings were quite satisfactory in terms of solving multiple types of difficulties, e.g., imbalanced dataset, small dataset, and short text.

论文关键词:Swarm intelligence,Particle swarm optimization (PSO),Hybrid feature selection,Short texts,Chi-square,Arabic email text,Forensic analysis

论文评审过程:Received 4 April 2021, Revised 27 May 2021, Accepted 9 June 2021, Available online 12 June 2021, Version of Record 16 June 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107224