Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground Truth

Highlights：

• The proposed algorithm uses context information to segregate semantically related error variants from the unrelated ones.

• String similarity measures are used to join error variants with the correct query word.

• The algorithm is tested on Bangla, Hindi and English datasets to show that the proposed approach is language-independent.

• The Bangla and Hindi datasets have the clean, error-free versions for comparison. So, we have used the performances on the clean text versions as the performance upper-bounds. In addition, we have compared our method with an error modelling approach which, unlike our method, uses the clean version.

• The English dataset is a genuine use case scenario for our algorithm as this dataset does not have the error-free version.

• Our proposed method produces significant improvements on most of the baselines.

• We have also tested our proposed algorithm on TREC 5 Confusion track dataset and showed that our proposed method is significantly better than the baselines.

摘要

•The proposed algorithm uses context information to segregate semantically related error variants from the unrelated ones.•String similarity measures are used to join error variants with the correct query word.•The algorithm is tested on Bangla, Hindi and English datasets to show that the proposed approach is language-independent.•The Bangla and Hindi datasets have the clean, error-free versions for comparison. So, we have used the performances on the clean text versions as the performance upper-bounds. In addition, we have compared our method with an error modelling approach which, unlike our method, uses the clean version.•The English dataset is a genuine use case scenario for our algorithm as this dataset does not have the error-free version.•Our proposed method produces significant improvements on most of the baselines.•We have also tested our proposed algorithm on TREC 5 Confusion track dataset and showed that our proposed method is significantly better than the baselines.