An abusive text detection system based on enhanced abusive and non-abusive word lists

作者：

Highlights：

• We enhance abusive and non-abusive word lists based on learning algorithms and gadgets.

• We design an effective abusive text detection system using both word lists.

• We evaluate the system using real-world data and show its effectiveness.

摘要

Abusive text (indiscriminate slang, abusive language, and profanity) on the Internet is not just a message but rather a tool for very serious and brutal cyber violence. It has become an important problem to devise a method for detecting and preventing abusive text online. However, the intentional obfuscation of words and phrases makes this task very difficult and challenging. We design a decision system that successfully detects (obfuscated) abusive text using an unsupervised learning of abusive words based on word2vec's skip-gram and the cosine similarity. The system also deploys several efficient gadgets for filtering abusive text such as blacklists, n-grams, edit-distance metrics, mixed languages, abbreviations, punctuation, and words with special characters to detect the intentional obfuscation of abusive words. We integrate both an unsupervised learning method and efficient gadgets into a single system that enhances abusive and non-abusive word lists. The integrated decision system based on the enhanced word lists shows a precision of 94.08%, a recall of 80.79%, and an f-score of 86.93% in malicious word detection for news article comments, a precision of 89.97%, a recall of 80.55%, and an f-score 85.00% for online community comments, and a precision of 90.65%, a recall of 93.57%, and an f-score 92.09% for Twitter tweets. We expect that our approach can help to improve the current abusive word detection system, which is crucial for several web-based services including social networking services and online games.

论文关键词：Abusive words,Slang words,Profanity,Cyber bullying,Detection systems

论文评审过程：Received 15 December 2017, Revised 25 June 2018, Accepted 26 June 2018, Available online 30 June 2018, Version of Record 11 August 2018.

论文官网地址：https://doi.org/10.1016/j.dss.2018.06.009