EgoCast: Forecasting Egocentric Human Pose in the Wild

1CINFONIA, Universidad de Los Andes, Bogotá

2Google, Zürich

EgoCast is a novel framework for full-body pose forecasting. We use visual and proprioceptive cues to accurately predict body motion.

Abstract

Accurately estimating and forecasting human body pose is important for enhancing the user's sense of immersion in Augmented Reality. Addressing this need, our paper introduces EgoCast, a bimodal method for 3D human pose forecasting using egocentric videos and proprioceptive data. We study the task of human pose forecasting in a realistic setting, extending the boundaries of temporal forecasting in dynamic scenes and building on the current framework for current pose estimation in the wild. We introduce a current-frame estimation module that generates pseudo-groundtruth poses for inference, eliminating the need for past groundtruth poses typically required by current methods during forecasting. Our experimental results on the recent Ego-Exo4D and Aria Digital Twin datasets validate EgoCast for real-life motion estimation. On the Ego-Exo4D Body Pose 2024 Challenge, our method significantly outperforms the state-of-the-art approaches, laying the groundwork for future research in human pose estimation and forecasting in unscripted activities with egocentric inputs.

EgoCast

Human Pose Forecasting Benchmark

The task for the Human Pose Forecasting Benchmark is predicting a set of 3D human poses in the future given visual and proprioceptive data from the past. EgoCast focuses on a realistic forecasting setting since it does not assume that the models will have access to ground-truth poses from previous frames, and we evaluate on longer than usual timeframes.

Benchmark formulation.

Architecture

Given a sequence of camera poses and RGB images from the headset, we first estimate the 3D full-body pose at each timestamp via the current-frame estimation module. Then, the pose forecasting module uses these predicted body poses, together with the proprioceptive inputs, to estimate the 3D human poses in the following frames in the future. Overall, Egocast proposes a whole pipeline for 3D human pose estimation in real-world scenarios where only the input streams given by the headset are available.

Interpolate start reference image.

Results

Current-frame pose estimation

Interpolate start reference image.

Pose forecasting

Interpolate start reference image.

BibTeX

@article{escobar2025egocast,
  author    = {Escobar, Maria and Puentes, Juanita and Forigua, Cristhian and Pont-Tuset, Jordi and Maninis, Kevis-Kokitsi and Arbeláez, Pablo},
  title     = {EgoCast: Forecasting Egocentric Human Pose in the Wild},
  booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  year      = {2025},
}