BLSTM and CNN Stacking Architecture for Speech Emotion Recognition

作者:Dongdong Li, Linyu Sun, Xinlei Xu, Zhe Wang, Jing Zhang, Wenli Du

摘要

Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving time series acoustic features and Convolutional Neural Network (CNN) can discover the local structure among different features. This paper proposed the BLSTM and CNN Stacking Architecture (BCSA) to enhance the ability to recognition emotions. In order to match the input formats of BLSTM and CNN, slicing feature matrices is necessary. For utilizing the different roles of the BLSTM and CNN, the Stacking is employed to integrate the BLSTM and CNN. In detail, taking into account overfitting problem, the estimates of probabilistic quantities from BLSTM and CNN are combined as new data using K-fold cross validation. Finally, based on the Stacking models, the logistic regression is used to recognize emotions effectively by fitting the new data. The experiment results demonstrate that the performance of proposed architecture is better than that of single model. Furthermore, compared with the state-of-the-art model on SER in our knowledge, the proposed method BCSA may be more suitable for SER by integrating time series acoustic features and the local structure among different features.

论文关键词:Speech emotion recognition, Convolutional neural network, Bidirectional long short term memory, Stacking

论文评审过程:

论文官网地址:https://doi.org/10.1007/s11063-021-10581-z