Spatiotemporal Fusion Networks for Video Action Recognition

作者:Zheng Liu, Haifeng Hu, Junxuan Zhang

摘要

Learning spatiotemporal information is a fundamental part in action recognition. In this work, we attempt to extract efficient spatiotemporal information for video representation through a novel architecture, termed as SpatioTemporal Fusion Networks (STFN). STFN extract spatiotemporal information by introducing connections between the spatial and temporal streams in two-stream networks with fusion blocks, called as Compactly Fuse Spatial and Temporal information (CFST) block, whose goal is to integrate spatial and temporal information with little computational cost. CFST is built upon Compact Bilinear Pooling which can capture multiplicative interactions at corresponding locations. For better integration of two streams, we make an exploration of fusion configuration about where to insert fusion block and a combination of CFST block and additive interaction. We evaluate our proposed architecture on UCF-101 and HMDB-51, and obtain a comparable performance.

论文关键词:Action recognition, Convolutional neural networks, Compact bilinear pooling

论文评审过程:

论文官网地址:https://doi.org/10.1007/s11063-018-09972-6