Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network
作者:
Highlights:
• An effective landmark localization pipeline based on landmark detection, optical flow estimation, and Kalman filter, is proposed to avoid face shake.
• Part-based autoencoder is introduced to learn low-dimensional representation on different face regions.
• A sequence-to-sequence convolutional neural network with residual units is proposed to learn the mapping from phoneme to facial codes.
• The method is tested two public audio-visual datasets and a new dataset called Chinese CCTV News demonstrate the effectiveness of the proposed method against other state-of-the-art methods.
摘要
•An effective landmark localization pipeline based on landmark detection, optical flow estimation, and Kalman filter, is proposed to avoid face shake.•Part-based autoencoder is introduced to learn low-dimensional representation on different face regions.•A sequence-to-sequence convolutional neural network with residual units is proposed to learn the mapping from phoneme to facial codes.•The method is tested two public audio-visual datasets and a new dataset called Chinese CCTV News demonstrate the effectiveness of the proposed method against other state-of-the-art methods.
论文关键词:Convolutional neural network,Autoencoder,Regression,Face landmark,Face tracking,Lip sync,Video,Audio
论文评审过程:Received 30 May 2019, Revised 26 December 2019, Accepted 23 January 2020, Available online 24 January 2020, Version of Record 7 February 2020.
论文官网地址:https://doi.org/10.1016/j.patcog.2020.107231