Combination of HMMs for the representation of printed characters in noisy document images

作者:

Highlights:

摘要

Many methods of printed character recognition have been proposed to-date, but although performance figures are usually stated for a particular set of fonts or size of text, it is rarely clear under what conditions of noise the measurements were taken. Baird has suggested a model of Document Imaging Defects, which enables authors to compare results against an emerging standard where one figure can be quoted to quantify the level of noise present in the document image. In this paper a novel method is proposed for the recognition of printed characters, and its extension to the segmentation and recognition of noisy printed words is outlined. The method is based on the representation of the shape of a character by two Hidden Markov Models. Recognition is achieved by scoring these models against the test pattern and combining the results. The method has been evaluated using Baird's noise model, producing a peak performance of 99.5% on the test set in the presence of near-minimal noise. The method generalizes to recognize characters with noise levels greater than those included in the training set, and an investigation of the top-k performance suggests that much of the effect of noise on the recognition performance on images of natural language text could be overcome using a word recognizer employing shallow contextual knowledge.

论文关键词:character recognition,Hidden Markov Models,shallow contextual knowledge

论文评审过程:Received 28 July 1994, Revised 28 October 1994, Available online 16 December 1999.

论文官网地址:https://doi.org/10.1016/0262-8856(95)99725-G