A coupled HMM approach to video-realistic speech animation

作者:

Highlights:

摘要

We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio–visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio–visual speech is promising for speech animation.

论文关键词:Speech animation,Audio-to-visual conversion,Talking faces,Facial animation,Coupled hidden Markov models (CHMMs)

论文评审过程:Received 13 March 2006, Revised 4 October 2006, Accepted 1 December 2006, Available online 18 January 2007.

论文官网地址:https://doi.org/10.1016/j.patcog.2006.12.001