Information extraction from scanned invoice images using text analysis and layout features

作者:

Highlights:

• Invoice information extraction is an inevitable task in bulk document processing.

• The current best systems are based on less flexible predefined invoice templates.

• OCRMiner uses content and layout processing technique inspired by the human way.

• The system is prepared and evaluated with multilingual environment.

• The training process uses a very small development set of a few invoices.

• OCRMiner reaches accuracy comparable to systems trained on huge curated datasets.

摘要

•Invoice information extraction is an inevitable task in bulk document processing.•The current best systems are based on less flexible predefined invoice templates.•OCRMiner uses content and layout processing technique inspired by the human way.•The system is prepared and evaluated with multilingual environment.•The training process uses a very small development set of a few invoices.•OCRMiner reaches accuracy comparable to systems trained on huge curated datasets.

论文关键词:OCR,Information extraction,Scanned documents,Document metadata,Invoice metadata extraction,Metadata indexing

论文评审过程:Received 19 January 2021, Revised 1 December 2021, Accepted 11 December 2021, Available online 16 December 2021, Version of Record 28 December 2021.

论文官网地址:https://doi.org/10.1016/j.image.2021.116601