Automatic extraction of bilingual word pairs using inductive chain learning in various languages

作者:

Highlights:

摘要

In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficient because of the sparse data problem. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any prior preparation of a bilingual resource (e.g., a bilingual dictionary, a machine translation system). We call this learning method Inductive Chain Learning (ICL). Moreover, the system using ICL can extract bilingual word pairs even from bilingual sentence pairs for which the grammatical structures of the source language differ from the grammatical structures of the target language because the acquired rules have the information to cope with the different word orders of source language and target language in local parts of bilingual sentence pairs. Evaluation experiments demonstrated that the recalls of systems based on several statistical approaches were improved through the use of ICL.

论文关键词:Learning method,Bilingual word pairs,Various languages,Sparse data problem,Parallel corpora,Statistical approach

论文评审过程:Received 31 July 2005, Revised 23 November 2005, Accepted 30 November 2005, Available online 23 January 2006.

论文官网地址:https://doi.org/10.1016/j.ipm.2005.11.004