NavThinker: Action-Conditioned World Models for Social Navigation

NavThinker: Action-Conditioned World Models for Coupled
Prediction and Planning in Social Navigation

¹ The Hong Kong University of Science and Technology

² The Hong Kong University of Science and Technology (Guangzhou)

³ National University of Singapore

⁴ University of Chinese Academy of Sciences

^* Corresponding author

Teaser

NavThinker: future-aware social navigation. Social navigation requires the robot to reach a goal in environments shared with dynamic pedestrians. Red regions highlight interaction zones where the robot and pedestrians may conflict. Top: egocentric depth observations. Bottom: the world model imagines action-conditioned futures for the corresponding observations. By imagining how the scene evolves under different candidate actions, NavThinker anticipates potential social conflicts and selects safe, socially compliant actions.

Abstract

Social navigation requires robots to act safely in dynamic human environments. Effective behavior demands thinking ahead: reasoning about how the scene and pedestrians evolve under different robot actions rather than reacting to current observations alone. This creates a coupled prediction-planning challenge, where robot actions and human motion mutually influence each other.

To address this challenge, we propose NavThinker, a future-aware framework that couples an action-conditioned world model with on-policy reinforcement learning. The world model operates in the Depth Anything V2 patch feature space and performs autoregressive prediction of future scene geometry and human motion; multi-head decoders then produce future depth maps and human trajectories, yielding a future-aware state aligned with traversability and interaction risk.

Crucially, we train the policy with DD-PPO while injecting world-model think-ahead signals via: (i) action-conditioned future features fused into the current observation embedding and (ii) social reward shaping from predicted human trajectories. Experiments on single- and multi-robot Social-HM3D show state-of-the-art navigation success, with zero-shot transfer to Social-MP3D and real-world deployment on a Unitree Go2, validating generalization and practical applicability.

Overview

Fig. 1. Overview of the NavThinker framework. Our framework consists of two modules: a world model that learns action-conditioned scene dynamics (top), and an imagination-augmented planner policy trained with DD-PPO (bottom).

During World Model Learning, depth observations, actions, and robot states are extracted from the Replay Buffer. A frozen DA-V2 ViT encoder extracts patch embeddings, and a Causal Attention Transformer autoregressively predicts future latent features. A Depth Decoder and a Human Trajectory Decoder are trained alongside a latent consistency loss to anchor representations to scene geometry and human motion.

During Policy Learning, the Observation Encoder produces the current embedding from depth and robot states. The Imagination module queries the world model under each candidate action, generating action-conditioned future features that are fused with the current embedding via Feature Fusion. The fused representation feeds into the DD-PPO Actor-Critic network. Predicted human trajectories additionally provide Reward Shaping, coupling prediction with planning in an imagine-then-act loop.

Single-Robot Social Navigation

Table 1. Results on Social-HM3D and Social-MP3D (zero-shot transfer from Social-HM3D). SR / SPL: success & efficiency (↑); PSC: social compliance (↑); H-Coll: human collisions (↓). Bold = best, underline = 2nd-best.

Methods	Social-HM3D				Social-MP3D
Methods	SR↑	SPL↑	PSC↑	H-Coll↓	SR↑	SPL↑	PSC↑	H-Coll↓
Rule-based
A*	44.81	43.99	90.38	54.80	45.67	44.69	91.97	54.00
ORCA	37.44	32.91	92.23	39.77	38.81	34.65	94.03	39.86
Reinforcement Learning-based
Habitat-official	38.99	33.53	90.37	55.48	37.00	31.76	92.03	52.33
Falcon	56.26	52.05	89.76	41.22	51.67	45.54	92.53	40.67
NavThinker	59.46	55.00	89.91	39.09	47.33	41.71	93.68	37.67

Multi-Robot Social Navigation

Table 2. On Social-HM3D: multiple robots navigate to individual goals without communication. SR / SPL (↑), PSC (↑), H-Coll (↓), plus team-level T-SR / T-SPL (↑). Bold = best, underline = 2nd-best.

Method	SR↑	SPL↑	PSC↑	H-Coll↓	T-SR↑	T-SPL↑
Rule-Based
A*	26.06	25.70	95.20	35.68	14.76	14.51
ORCA	24.13	22.48	95.53	35.13	15.09	13.84
Reinforcement Learning
Habitat-official	26.78	24.41	95.47	31.39	14.98	13.68
Falcon	28.63	26.48	94.98	28.63	16.08	15.12
NavThinker	30.04	28.14	95.55	25.55	16.30	15.22

Citation

If you find this work useful, please consider citing:

@article{hu2026navthinker,
  title   = {NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation},
  author  = {Hu, Tianshuai and Gong, Zeying and Kong, Lingdong and Mei, XiaoDong and Ding, Yiyi and Zeng, Qi and Liang, Ao and Li, Rong and Zhong, Yangyi and Liang, Junwei},
  journal = {arXiv preprint arXiv:2603.15359},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.15359}
}

NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation