An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm

作者:

Highlights:

• World produces huge amount of unstructured text, which is useless without labeling.

• Humans can never annotate such massive textual data.

• In multi-label classification, we assign multiple labels to each instance.

• We propose multi-label topic modeling and genetic algorithms to annotate texts.

• Our automatic annotation agrees 79.3% with crowdsourced humans.

摘要

•World produces huge amount of unstructured text, which is useless without labeling.•Humans can never annotate such massive textual data.•In multi-label classification, we assign multiple labels to each instance.•We propose multi-label topic modeling and genetic algorithms to annotate texts.•Our automatic annotation agrees 79.3% with crowdsourced humans.

论文关键词:Arabic corpus,Topic modeling,Multi-label annotation,Genetic algorithm,Latent Dirichlet allocation,Crowdsourcing

论文评审过程:Received 12 September 2021, Revised 6 January 2022, Accepted 25 April 2022, Available online 6 May 2022, Version of Record 13 May 2022.

论文官网地址:https://doi.org/10.1016/j.eswa.2022.117384