Wrist-view observations are crucial for VLA models as they capture fine-grained hand–object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with precisely the geometric and cross-view priors that make it possible to address such extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our proposed Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our designed video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.
We introduce a two-stage 4D Generative World Model. In the reconstruction stage, VGGT is extended with a wrist head to regress wrist pose, guided by a Spatial Projection Consistency Loss that supervises directly from RGB without depth or extrinsics. The predicted pose projects point clouds into the wrist view. In the generation stage, these projections, combined with external-view CLIP embeddings, condition a video generator to synthesize wrist-view sequences. Without first-frame guidance, the model produces additional wrist views for VLA datasets, yielding substantial performance gains.
We evaluate WristWorld on Droid, Calvin, and a real Franka Panda setup. The system trains in two stages: Reconstruction extends VGGT with a wrist head and a Spatial Projection Consistency loss to recover wrist poses and 4D point clouds directly from RGB, while Generation conditions a video generator on wrist-view projections and external-view features to synthesize temporally coherent wrist-view videos without a wrist first frame. Pretraining on a large multi-view corpus followed by cross-view fine-tuning enables strong generalization. Quantitatively, WristWorld reduces FVD and improves LPIPS, SSIM, and PSNR; as a plug-and-play add-on to single-view world models it further cuts FVD substantially. For downstream policies, generated wrist views boost VLA, increasing average task completion length by 3.81 and closing 42.4% of the anchor–wrist gap. The figures below summarize video metrics, VLA gains, and ablations.
We showcase qualitative results across simulated and real settings. Use the selector below to view side-by-side rollouts where the left panel is the ground-truth wrist view and the right panel is our generated wrist-view video conditioned only on anchor views.
Left: ground truth wrist view. Right: WristWorld generation.
w/o Ours
w/o Ours
Ours
Ours
w/o Ours — Close the upper drawer — 2× speed
Ours — Close the upper drawer — 2× speed
w/o Ours — Pick up the bread — 2× speed
Ours — Pick up the bread — 2× speed