Attention guided deep audio-face fusion for efficient speaker naming

作者:

Highlights:

摘要

Speaker naming has recently received considerable attention in identifying the active speaking character in a movie video, and face cue alone is generally insufficient to achieve reliable performance due to its significant appearance variations. In this paper, we treat the speaker naming task as a group of matched audio-face pair finding problems, and present an efficient attention guided deep audio-face fusion approach to detect the active speakers. First, we start with VGG-encoding of face images and extract the Mel-Frequency Cepstrum Coefficients from audio signals. Then, two efficient audio encoding modules, namely two-layer Long Short-Term Memory encoding and two-dimensional convolution encoding, are addressed to discriminate the high-level audio features. Meanwhile, we train an end-to-end audio-face common attention model to discriminate the face attention vector, featuring adaptively to accommodate various face variations. Further, an efficient factorized bilinear model is presented to deeply fuse the paired audio-face features, whereby the joint audio-face representation can be reliably obtained for speaker naming. Extensive experiments highlight the superiority of the proposed approach and show its very competitive performance with the state-of-the-arts.

论文关键词:Speaker naming,Deep audio-face fusion,Common attention model,Factorized bilinear model

论文评审过程:Received 21 November 2017, Revised 24 October 2018, Accepted 15 December 2018, Available online 18 December 2018, Version of Record 21 December 2018.

论文官网地址:https://doi.org/10.1016/j.patcog.2018.12.011