Arbitrarily shaped scene text detection with dynamic convolution

作者:

Highlights:

• According to the detailed characteristics of the text instance, we dynamically generate the convolutional kernels from multi-feature for different instances. The specific attributes such as position, scale, and center, have been embedded into the convolutional kernel so that the mask prediction task using the text-instance-aware kernel will focus on the pixels that belong to themselves. Obviously, this design is helpful to improve the detection accuracy of adjacent text instances.

• We generate the respective mask prediction head for each instance in parallel. These heads predict masks on the original feature map and retain resolution details of the text instance. It is no longer necessary to crop the RoIs and force them to be the same size. Our architecture overcomes the problem that a set of fixed convolution kernels cannot adapt to all resolutions, and at the same time preventing the loss of information caused by the multi-scales of the instances.

• Because improving the text-instance-aware convolutional kernel increases the capacity of the model, we can also achieve competitive results with a very compact prediction head. Therefore, multiple mask prediction heads can be concurrently predicted without bringing significant computational overhead.

• For the sake of improving the performance and accelerating the convergence of training, we design a text-shape sensitive position embedding to explicitly provide the location information to the mask prediction head.

摘要

•According to the detailed characteristics of the text instance, we dynamically generate the convolutional kernels from multi-feature for different instances. The specific attributes such as position, scale, and center, have been embedded into the convolutional kernel so that the mask prediction task using the text-instance-aware kernel will focus on the pixels that belong to themselves. Obviously, this design is helpful to improve the detection accuracy of adjacent text instances.•We generate the respective mask prediction head for each instance in parallel. These heads predict masks on the original feature map and retain resolution details of the text instance. It is no longer necessary to crop the RoIs and force them to be the same size. Our architecture overcomes the problem that a set of fixed convolution kernels cannot adapt to all resolutions, and at the same time preventing the loss of information caused by the multi-scales of the instances.•Because improving the text-instance-aware convolutional kernel increases the capacity of the model, we can also achieve competitive results with a very compact prediction head. Therefore, multiple mask prediction heads can be concurrently predicted without bringing significant computational overhead.•For the sake of improving the performance and accelerating the convergence of training, we design a text-shape sensitive position embedding to explicitly provide the location information to the mask prediction head.

论文关键词:Scene text detection,Image segmentation,Arbitrary shape,Dynamic convolution

论文评审过程:Received 12 April 2021, Revised 10 January 2022, Accepted 22 February 2022, Available online 23 February 2022, Version of Record 28 February 2022.

论文官网地址:https://doi.org/10.1016/j.patcog.2022.108608