Linear dynamical systems approach for human action recognition with dual-stream deep features

摘要

Human action recognition with a dual-stream architecture using linear dynamical systems (LDSs) approach is discussed in this paper. First, a slice process is established to extract original slices from video sequences. Two slicing methods are adopted to subtract or reserve the remaining frames in the video sequences. By applying background subtraction to adjacent frames of the original slices, difference slices are also expressed. To capture the spatial component of the background and difference expressed in each slice simultaneously, a framework based on pre-trained convolutional neural networks (CNNs) is introduced for dual-stream deep feature extraction. Subsequently, LDSs are established to model the timing relationship between adjacent slices and obtain the temporal component of the background and difference features, which are expressed as linear dynamical background feature (LD-BF) and linear dynamical difference feature (LD-DF). Practical experiments were conducted to demonstrate the effectiveness and robustness of the proposed approach using different datasets. Specifically, our experiments were conducted on the UCF50, UCF101, and hmdb51 datasets. The impact of retaining various principal component analysis (PCA) feature dimensions and distinct slicing methods in terms of detail recognition were evaluated. In particular, combining LD-BF with LD-DF under appropriate feature dimensions and slicing methods further improved the accuracy for the UCF50, UCF101, and hmdb51 datasets. In addition, the computational cost of the feature extraction process was evaluated to illustrate the efficiency of the proposed approach. The experimental results show that the proposed approach is competitive with state-of-the-art approaches in the three datasets.