2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Download PDF

Abstract

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly gener-ate structural human skeletons, resulting in the omission of appearance information, we focus on the direct gener-ation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to gener-ate co-speech gesture videos. Specifically, we first intro-duce a well-designed nonlinear TPS transformation to ob-tain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection mod-ule to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of cer-tain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles