Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground Truth

作者:

Highlights:

• The proposed algorithm uses context information to segregate semantically related error variants from the unrelated ones.

• String similarity measures are used to join error variants with the correct query word.

• The algorithm is tested on Bangla, Hindi and English datasets to show that the proposed approach is language-independent.

• The Bangla and Hindi datasets have the clean, error-free versions for comparison. So, we have used the performances on the clean text versions as the performance upper-bounds. In addition, we have compared our method with an error modelling approach which, unlike our method, uses the clean version.

• The English dataset is a genuine use case scenario for our algorithm as this dataset does not have the error-free version.

• Our proposed method produces significant improvements on most of the baselines.

• We have also tested our proposed algorithm on TREC 5 Confusion track dataset and showed that our proposed method is significantly better than the baselines.

摘要

•The proposed algorithm uses context information to segregate semantically related error variants from the unrelated ones.•String similarity measures are used to join error variants with the correct query word.•The algorithm is tested on Bangla, Hindi and English datasets to show that the proposed approach is language-independent.•The Bangla and Hindi datasets have the clean, error-free versions for comparison. So, we have used the performances on the clean text versions as the performance upper-bounds. In addition, we have compared our method with an error modelling approach which, unlike our method, uses the clean version.•The English dataset is a genuine use case scenario for our algorithm as this dataset does not have the error-free version.•Our proposed method produces significant improvements on most of the baselines.•We have also tested our proposed algorithm on TREC 5 Confusion track dataset and showed that our proposed method is significantly better than the baselines.

论文关键词:Information Retrieval,OCR error,Word co-occurrence

论文评审过程:Received 14 July 2015, Revised 8 March 2016, Accepted 24 March 2016, Available online 27 May 2016, Version of Record 22 July 2016.

论文官网地址:https://doi.org/10.1016/j.ipm.2016.03.006