Document retrieval from compressed images
作者:
Highlights:
•
摘要
With the emergence of digital libraries, more and more documents are stored and transmitted through the Internet in the format of compressed images. It is of significant meaning to develop a system which is capable of retrieving documents from these compressed document images. Aiming at the popular compression standard-CCITT Group 4 which is widely used for compressing document images, we present an approach to retrieve the documents from CCITT Group 4 compressed document images in this paper. The black and white changing elements are extracted directly from the compressed document images to act as the feature pixels, and the connected components are detected simultaneously. Then the word boxes are bounded based on the merging of the connected components. Weighted Hausdorff distance is proposed to assign all of the word objects from both the query document and the document from database to corresponding classes by an unsupervised classifier, whereas the possible stop words are excluded. Document vectors are built by the occurrence frequency of the word object classes, and the pair-wise similarity of two document images is represented by the scalar product of the document vectors. Nine groups of articles pertaining to different domains are used to test the validity of the presented approach. Preliminary experimental results with the document images captured from students’ theses show that the proposed approach has achieved a promising performance.
论文关键词:Document image retrieval,Compressed image,Object matching,Document similarity,Weighted Hausdorff distance
论文评审过程:Received 27 November 2001, Accepted 18 April 2002, Available online 12 December 2002.
论文官网地址:https://doi.org/10.1016/S0031-3203(02)00127-9