Language models and fusion for authorship attribution

作者:

Highlights:

摘要

We deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora untypical for the task, i.e., with documents edited by non-professional writers, such as movie reviews or tweets. The former corpus is homogeneous with respect to the topic making the task more challenging, The latter corpus, puts language models into a framework of a continuously and fast evolving language, unique and noisy writing style, and limited length of social media messages. While we find that language models based on POS tags are competitive in only one of the corpora (movie reviews), they generally provide efficiency benefits and robustness against data sparsity. Furthermore, we experiment with model fusion, where language models based on different modalities are combined. By linearly combining three language models, based on characters, words, and POS trigrams, respectively, we achieve the best generalization accuracy of 96% on movie reviews, while the combination of language models based on characters and POS trigrams provides 54% accuracy on the Twitter corpus. In fusion, POS language models are proven essential effective components.

论文关键词:Authorship attribution,Language models,Computational linguistics,Text classification,Machine learning

论文评审过程:Received 10 August 2018, Revised 10 May 2019, Accepted 14 June 2019, Available online 5 July 2019, Version of Record 5 July 2019.

论文官网地址:https://doi.org/10.1016/j.ipm.2019.102061