Fusion of spectral and prosody modelling for multilingual speech emotion conversion

作者:

Highlights:

摘要

The paper proposes an integrated speech emotion conversion framework developed using speaker-independent mixed-lingual training. The key contribution of the work is non-parallel training using i-vector probabilistic linear discriminant analysis (PLDA) modelling for estimating emotion-dependent latent vectors for the three archetypal emotions anger, fear, and happiness in three different datasets (languages) viz. EmoDB (German), IITKGP (Telugu) and English (SAVEE). The unified model integrates fundamental frequency (F0) and spectral modifications for neutral to emotional speech conversion. Wavelet synchro squeezed decomposition of F0 and subsequent training using particle swarm optimized neural network (PSO-ANN) provides improved performance with an overall average mel cepstral distortion (MCD) of 4.72 dB and F0-RMSE of 25.91 Hz while subjective testing revealed an overall average mean opinion score (MOS) of 3.4, comparative mean opinion score (CMOS) of 3.57, and a speaker similarity score of 3.72, on a scale of 1–5. A detailed comparative analysis for emotion conversion in English with state-of-the-art is also performed. The evaluations revealed that the proposed framework gave perceptually relevant expressive enrichment in neutral speech with optimum training data.

论文关键词:Emotion,i-vector,PLDA,WSST,ANN,PSO,MOS,CMOS

论文评审过程:Received 27 May 2021, Revised 30 January 2022, Accepted 31 January 2022, Available online 9 February 2022, Version of Record 21 February 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2022.108360