E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis
作者:
Highlights:
• End-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video is proposed.
• Two techniques are proposed to synthesize a speech waveform from the Mel-scale and linear-scale features.
• Fast Griffin-Lim algorithm to synthesis spectrogram is proposed to synthesize intelligent and acceptable quality speech.
• 3D-CNN-based Waveform CRITIC is proposed to differentiate the real and synthesized speech waveforms.
摘要
•End-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video is proposed.•Two techniques are proposed to synthesize a speech waveform from the Mel-scale and linear-scale features.•Fast Griffin-Lim algorithm to synthesis spectrogram is proposed to synthesize intelligent and acceptable quality speech.•3D-CNN-based Waveform CRITIC is proposed to differentiate the real and synthesized speech waveforms.
论文关键词:Video processing,E2E speech synthesis,ResNet-18,Residual CNN,Waveform CRITIC
论文评审过程:Received 27 October 2021, Revised 5 January 2022, Accepted 16 January 2022, Available online 31 January 2022, Version of Record 10 February 2022.
论文官网地址:https://doi.org/10.1016/j.imavis.2022.104389