Multilingual sentence categorization and novelty mining
作者:
Highlights:
•
摘要
A challenge for sentence categorization and novelty mining is to detect not only when text is relevant to the user’s information need, but also when it contains something new which the user has not seen before. It involves two tasks that need to be solved. The first is identifying relevant sentences (categorization) and the second is identifying new information from those relevant sentences (novelty mining). Many previous studies of relevant sentence retrieval and novelty mining have been conducted on the English language, but few papers have addressed the problem of multilingual sentence categorization and novelty mining. This is an important issue in global business environments, where mining knowledge from text in a single language is not sufficient. In this paper, we perform the first task by categorizing Malay and Chinese sentences, then comparing their performances with that of English. Thereafter, we conduct novelty mining to identify the sentences with new information. Experimental results on TREC 2004 Novelty Track data show similar categorization performance on Malay and English sentences, which greatly outperform Chinese. In the second task, it is observed that we can achieve similar novelty mining results for all three languages, which indicates that our algorithm is suitable for novelty mining of multilingual sentences. In addition, after benchmarking our results with novelty mining without categorization, it is learnt that categorization is necessary for the successful performance of novelty mining.
论文关键词:Multilingual categorization,Sentence retrieval,Novelty mining,Malay,Chinese
论文评审过程:Received 29 April 2009, Revised 26 October 2009, Accepted 8 February 2010, Available online 6 March 2010.
论文官网地址:https://doi.org/10.1016/j.ipm.2010.02.003