TEXUS: A unified framework for extracting and understanding tables in PDF documents

作者:

Highlights:

摘要

Tables in documents are a widely-available and rich source of information, but not yet well-utilised computationally because of the difficulty in automatically extracting their structure and data content. There has been a plethora of systems proposed to solve the problem, but current methods present low usability and accuracy and lack precision in detecting data from diverse layouts. We propose a component-based design and implementation of table processing concepts which can offer flexibility and re-usability as well as high performance on a wide range of table types. In this paper, we describe a system named TEXUS which is a fully automated table processing system that takes a PDF document and detects tables in a layout independent manner. We introduce TEXUS’s own table processing specific document model and the two-phased processing pipeline design. Through an extensive evaluation on a dataset comprised of complex financial tables, we show the performance of the system on different table types.

论文关键词:Table processing,Table extraction,Table understanding,Automatic table extraction,Table detection

论文评审过程:Received 9 March 2018, Revised 5 November 2018, Accepted 21 January 2019, Available online 14 February 2019, Version of Record 14 February 2019.

论文官网地址:https://doi.org/10.1016/j.ipm.2019.01.008