Handwritten Chinese text line segmentation by clustering with distance metric learning

作者:

Highlights:

摘要

Separating text lines in unconstrained handwritten documents remains a challenge because the handwritten text lines are often un-uniformly skewed and curved, and the space between lines is not obvious. In this paper, we propose a novel text line segmentation algorithm based on minimal spanning tree (MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs) of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hypervolume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made robust to handle various documents with multi-skewed and curved text lines. In experiments on a database with 803 unconstrained handwritten Chinese document images containing a total of 8,169 lines, the proposed algorithm achieved a correct rate 98.02% of line detection, and compared favorably to other competitive algorithms.

论文关键词:Handwritten text line segmentation,Clustering,Minimal spanning tree (MST),Distance metric learning,Hypervolume reduction

论文评审过程:Received 7 August 2008, Revised 21 November 2008, Accepted 20 December 2008, Available online 4 January 2009.

论文官网地址:https://doi.org/10.1016/j.patcog.2008.12.013