2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR)
Download PDF

Abstract

Social XR applications usually require advanced tracking equipment to control one’s own avatar. We explore if AI-based co-speech gesture generation techniques can be employed to compensate for the lack of tracking hardware that many users face. One main challenge is to achieve convincing behavior quality without introducing too much latency. Previous work has shown that both depend – in opposite ways – on the length of the audio chunk the gestures are generated from, and that gesture quality of existing models declines with lower chunk sizes while still not reaching sufficiently low latency to enable fluent interaction. In this paper we present an approach that is able to generate continuous gesture trajectories frame by frame, minimizing latency and yielding delays well below buffer sizes of voice communication systems or video calls. A project page with videos of the generated gestures is available at https://nkrome.github.io/FrameCAGE.html.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles