Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network

作者：

Highlights：

• An effective landmark localization pipeline based on landmark detection, optical flow estimation, and Kalman filter, is proposed to avoid face shake.

• Part-based autoencoder is introduced to learn low-dimensional representation on different face regions.

• A sequence-to-sequence convolutional neural network with residual units is proposed to learn the mapping from phoneme to facial codes.

• The method is tested two public audio-visual datasets and a new dataset called Chinese CCTV News demonstrate the effectiveness of the proposed method against other state-of-the-art methods.

摘要

•An effective landmark localization pipeline based on landmark detection, optical flow estimation, and Kalman filter, is proposed to avoid face shake.•Part-based autoencoder is introduced to learn low-dimensional representation on different face regions.•A sequence-to-sequence convolutional neural network with residual units is proposed to learn the mapping from phoneme to facial codes.•The method is tested two public audio-visual datasets and a new dataset called Chinese CCTV News demonstrate the effectiveness of the proposed method against other state-of-the-art methods.

论文关键词：Convolutional neural network,Autoencoder,Regression,Face landmark,Face tracking,Lip sync,Video,Audio

论文评审过程：Received 30 May 2019, Revised 26 December 2019, Accepted 23 January 2020, Available online 24 January 2020, Version of Record 7 February 2020.

论文官网地址：https://doi.org/10.1016/j.patcog.2020.107231