Effective language identification of forum texts based on statistical approaches

作者：

Highlights：

• This investigation deals with the problem of language identification of noisy texts.

• Two statistical approaches are proposed: High Frequency Approach and Nearest Prototype Approach.

• The proposed methods are evaluated on forum datasets containing 32 different languages.

• An experimental comparison is made with LIGA, NTC, Google translate and Microsoft Word.

• Results show that the proposed approaches are interesting in language identification of forum texts.

摘要

•This investigation deals with the problem of language identification of noisy texts.•Two statistical approaches are proposed: High Frequency Approach and Nearest Prototype Approach.•The proposed methods are evaluated on forum datasets containing 32 different languages.•An experimental comparison is made with LIGA, NTC, Google translate and Microsoft Word.•Results show that the proposed approaches are interesting in language identification of forum texts.

论文关键词：Natural language processing,Automatic language identification,Forum texts,Hybrid approaches,Statistical approaches,N-Grams

论文评审过程：Received 18 March 2015, Revised 27 November 2015, Accepted 1 December 2015, Available online 31 December 2015, Version of Record 17 May 2016.

论文官网地址：https://doi.org/10.1016/j.ipm.2015.12.003