Summarization of Imaged Documents without OCR

作者:

Highlights:

摘要

A system is presented for creating a summary indicating the contents of an imaged document. The summary is composed from selected regions extracted from the imaged document. The regions may include sentences, key phrases, headings, and figures. The extracts are identified without the use of optical character recognition. The imaged document is first processed to identify the word-bounding boxes, the reading order of words, and the location of sentence and paragraph boundaries in the text. The word-bounding boxes are grouped into equivalence classes to mimic the terms in a text document. Equivalence classes representing content words are identified, and key phrases are identified from the set of content words. Summary sentences are selected using a statistically based classifier applied to a set of discrete sentence features. Evaluation of sentence selection against a set of abstracts created by a professional abstracting company is given.

论文关键词:

论文评审过程:Received 11 February 1997, Accepted 21 December 1997, Available online 10 April 2002.

论文官网地址:https://doi.org/10.1006/cviu.1998.0688