Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

作者：Xuqiang Zhuang, Fangai Liu, Jian Hou, Jianhua Hao, Xiaohong Cai

摘要

Social media allows users to express opinions in multiple modalities such as text, pictures, and short-videos. Multi-modal sentiment detection can more effectively predict the emotional tendencies expressed by users. Therefore, multi-modal sentiment detection has received extensive attention in recent years. Current works consider utterances from videos as independent modal, ignoring the effective interaction among diffence modalities of a video. To tackle these challenges, we propose transformer-based interactive multi-modal attention network to investigate multi-modal paired attention between multiple modalities and utterances for video sentiment detection. Specifically, we first take a series of utterances as input and use three separate transformer encoders to capture the utterances-level features of each modality. Subsequently, we introduced multimodal paired attention mechanisms to learn the cross-modality information between multiple modalities and utterances. Finally, we inject the cross-modality information into the multi-headed self-attention layer for making final emotion and sentiment classification. Our solutions outperform baseline models on three multi-modal datasets.

论文关键词：Multimodal, Transformer, Sentiment detection

论文评审过程：

论文官网地址：https://doi.org/10.1007/s11063-021-10713-5