E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

作者：

Highlights：

• End-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video is proposed.

• Two techniques are proposed to synthesize a speech waveform from the Mel-scale and linear-scale features.

• Fast Griffin-Lim algorithm to synthesis spectrogram is proposed to synthesize intelligent and acceptable quality speech.

• 3D-CNN-based Waveform CRITIC is proposed to differentiate the real and synthesized speech waveforms.

摘要

•End-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video is proposed.•Two techniques are proposed to synthesize a speech waveform from the Mel-scale and linear-scale features.•Fast Griffin-Lim algorithm to synthesis spectrogram is proposed to synthesize intelligent and acceptable quality speech.•3D-CNN-based Waveform CRITIC is proposed to differentiate the real and synthesized speech waveforms.

论文关键词：Video processing,E2E speech synthesis,ResNet-18,Residual CNN,Waveform CRITIC

论文评审过程：Received 27 October 2021, Revised 5 January 2022, Accepted 16 January 2022, Available online 31 January 2022, Version of Record 10 February 2022.

论文官网地址：https://doi.org/10.1016/j.imavis.2022.104389