BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis

1CINFONIA, Universidad de Los Andes, 2University of Caen Normandie, ENSICAEN, CNRS, France,

3Meta AI
*Denotes equal contribution.

BoDiffusion synthesizes more accurate motions with substantially less jitter than AvatarPoser.

Abstract

Mixed reality applications require tracking the user's full-body motion to enable an immersive experience. However, typical head-mounted devices can only track head and hand movements, leading to a limited reconstruction of full-body motion due to variability in lower body configurations.

We propose BoDiffusion -- a generative diffusion model for motion synthesis to tackle this under-constrained reconstruction problem. We present a time and space conditioning scheme that allows BoDiffusion to leverage sparse tracking inputs while generating smooth and realistic full-body motion sequences. To the best of our knowledge, this is the first approach that uses the reverse diffusion process to model full-body tracking as a conditional sequence generation task. We conduct experiments on the large-scale motion-capture dataset AMASS and show that our approach outperforms the state-of-the-art by a significant margin in terms of full-body motion realism and joint reconstruction error.

BoDiffusion

Architecture

BoDiffusion is a diffusion process synthesizing full-body motion using sparse tracking signals as conditioning.

Interpolate start reference image.

Denoising Steps

During inference, we start from random Gaussian noise and perform T denoising steps until we reach a clean output motion.

Interpolate start reference image.

BoDiffusion synthesizes substantially more accurate and plausible full-body poses, particularly in the lower body where no IMU data are captured.

Interpolate start reference image.

More Examples

Unconventional Poses

BoDiffusion predicts plausible poses even for uncommon movements like crouching or lying down.

Interpolate start reference image.

Error on individual poses

BoDiffusion predicts poses with higher fidelity to the ground truth. In contrast, AvatarPoser struggles to predict accurate lower-body configurations.

Interpolate start reference image.

BibTeX

@article{castillo2023bodiffusion,
  author    = {Castillo, Angela and Escobar, Maria and Jeanneret, Guillaume and Pumarola, Albert and Arbeláez, Pablo and Thabet, Ali and Sanakoyeu, Artsiom},
  title     = {BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year      = {2023},
}