Accurately estimating and forecasting human body pose is important for enhancing the user's sense of immersion in Augmented Reality. Addressing this need, our paper introduces EgoCast, a bimodal method for 3D human pose forecasting using egocentric videos and proprioceptive data. We study the task of human pose forecasting in a realistic setting, extending the boundaries of temporal forecasting in dynamic scenes and building on the current framework for current pose estimation in the wild. We introduce a current-frame estimation module that generates pseudo-groundtruth poses for inference, eliminating the need for past groundtruth poses typically required by current methods during forecasting. Our experimental results on the recent Ego-Exo4D and Aria Digital Twin datasets validate EgoCast for real-life motion estimation. On the Ego-Exo4D Body Pose 2024 Challenge, our method significantly outperforms the state-of-the-art approaches, laying the groundwork for future research in human pose estimation and forecasting in unscripted activities with egocentric inputs.
The task for the Human Pose Forecasting Benchmark is predicting a set of 3D human poses in the future given visual and proprioceptive data from the past. EgoCast focuses on a realistic forecasting setting since it does not assume that the models will have access to ground-truth poses from previous frames, and we evaluate on longer than usual timeframes.
Given a sequence of camera poses and RGB images from the headset, we first estimate the 3D full-body pose at each timestamp via the current-frame estimation module. Then, the pose forecasting module uses these predicted body poses, together with the proprioceptive inputs, to estimate the 3D human poses in the following frames in the future. Overall, Egocast proposes a whole pipeline for 3D human pose estimation in real-world scenarios where only the input streams given by the headset are available.
@article{escobar2025egocast,
author = {Escobar, Maria and Puentes, Juanita and Forigua, Cristhian and Pont-Tuset, Jordi and Maninis, Kevis-Kokitsi and Arbeláez, Pablo},
title = {EgoCast: Forecasting Egocentric Human Pose in the Wild},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2025},
}