ATDA: Attentional temporal dynamic activation for speech emotion recognition

作者：

Highlights：

•

摘要

Speech emotion recognition (SER) plays a vital role in intelligent human–computer interaction (HCI). The Convolutional Neural Network (CNN) is widely used in SER, effectively capturing the static local features but ignoring the temporal dynamic features necessary for SER. To solve this problem, we place an Attentional Temporal Dynamic Activation (ATDA) module into the CNN-based model to empower it to learn the static and dynamic features simultaneously. In particular, the ATDA module comprises a Temporal Dynamic Activation (TDA) block followed by a Multi-view and Multi-granularity Attention (MMA) block. The TDA block calculates the temporal difference at the feature level to activate the dynamic information and generate the fundamental dynamic feature. The MMA block further detects and amplifies the emotion-related dynamic features based on multiple attention views and granularities. These two blocks within the ATDA module cooperate to activate and extract the dynamic emotional features. Meanwhile, the static features are obtained by a convolutional layer, which are then combined with the dynamic features to generate the final emotional representations. Finally, experiments on the IEMOCAP, MSP-IMPROV, and MELD datasets reveal that the proposed ATDA-CNN model achieves competitive results and enhances SER accuracy by learning meaningful emotional representations.

论文关键词：Speech emotion recognition,Temporal dynamic activation,Multi-view and multi-granularity attention,Experimental analysis

论文评审过程：Received 5 November 2021, Revised 12 February 2022, Accepted 16 February 2022, Available online 24 February 2022, Version of Record 3 March 2022.

论文官网地址：https://doi.org/10.1016/j.knosys.2022.108472