Combining Embeddings of Input Data for Text Classification
作者:Zuzanna Parcheta, Germán Sanchis-Trilles, Francisco Casacuberta, Robin Rendahl
摘要
The problem of automatic text classification is an essential part of text analysis. The improvement of text classification can be done at different levels such as a preprocessing step, network implementation, etc. In this paper, we focus on how the combination of different methods of text encoding may affect classification accuracy. To do this, we implemented a multi-input neural network that is able to encode input text using several text encoding techniques such as BERT, neural embedding layer, GloVe, skip-thoughts and ParagraphVector. The text can be represented at different levels of tokenised input text such as the sentence level, word level, byte pair encoding level and character level. Experiments were conducted on seven datasets from different language families: English, German, Swedish and Czech. Some of those languages contain agglutinations and grammatical cases. Two out of seven datasets originated from real commercial scenarios: (1) classifying ingredients into their corresponding classes by means of a corpus provided by Northfork; and (2) classifying texts according to the English level of their corresponding writers by means of a corpus provided by ProvenWord. The developed architecture achieves an improvement with different combinations of text encoding techniques depending on the different characteristics of the datasets. Once the best combination of embeddings at different levels was determined, different architectures of multi-input neural networks were compared. The results obtained with the best embedding combination and best neural network architecture were compared with state-of-the-art approaches. The results obtained with the dataset used in the experiments were better than the state-of-the-art baselines.
论文关键词:Text classification, Multi-input network, Agglutinative language, Inflected language, Embedding combination
论文评审过程:
论文官网地址:https://doi.org/10.1007/s11063-020-10312-w