Multi-level out-of-vocabulary words handling approach

作者：

Highlights：

•

摘要

Distributed representation models can generate a vector representation only for words that belong to a finite vocabulary collected from the training data. If out-of-vocabulary (OOV) words are not handled properly, they can impair the performance of machine learning methods in a given natural language processing task. This study offers a new methodology based on the consolidated top-down human reading theory, which may serve as a strong basis for developing new techniques to deal with the OOV problem. For this, we present MLOH, a Multi-Level OOV Handling approach, based on three chained strategies: analogy, decoding, and prediction. The techniques available in the literature, in general, are limited since they often resolve specific types of OOV words, such as those that can be inferred by analyzing their morphological structure or context. Compared to the process used by human readers to infer unknown words, using a single strategy is generally not effective. We evaluated MLOH performance on tasks that can be highly affected by OOV words, such as part-of-speech tagging, named entity recognition, and text categorization of short and noisy texts. The results indicate that the proposed approach is promising since it could handle most of the OOV words presented, is more generalist, and obtained competitive performance in all experiments.

论文关键词：Out-of-vocabulary words,Distributed vector representation,Natural language processing,Machine learning

论文评审过程：Received 12 January 2022, Revised 8 April 2022, Accepted 23 April 2022, Available online 14 May 2022, Version of Record 1 July 2022.

论文官网地址：https://doi.org/10.1016/j.knosys.2022.108911