WristWorld: Generating Wrist-Views via
4D World Models for Robotic Manipulation



Abstract

Wrist-view observations are crucial for VLA models as they capture fine-grained hand–object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with precisely the geometric and cross-view priors that make it possible to address such extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our proposed Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our designed video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

Model Architecture

We introduce a two-stage 4D Generative World Model. In the reconstruction stage, VGGT is extended with a wrist head to regress wrist pose, guided by a Spatial Projection Consistency Loss that supervises directly from RGB without depth or extrinsics. The predicted pose projects point clouds into the wrist view. In the generation stage, these projections, combined with external-view CLIP embeddings, condition a video generator to synthesize wrist-view sequences. Without first-frame guidance, the model produces additional wrist views for VLA datasets, yielding substantial performance gains.

Model Architecture

Experiment

We evaluate WristWorld on Droid, Calvin, and a real Franka Panda setup. The system trains in two stages: Reconstruction extends VGGT with a wrist head and a Spatial Projection Consistency loss to recover wrist poses and 4D point clouds directly from RGB, while Generation conditions a video generator on wrist-view projections and external-view features to synthesize temporally coherent wrist-view videos without a wrist first frame. Pretraining on a large multi-view corpus followed by cross-view fine-tuning enables strong generalization. Quantitatively, WristWorld reduces FVD and improves LPIPS, SSIM, and PSNR; as a plug-and-play add-on to single-view world models it further cuts FVD substantially. For downstream policies, generated wrist views boost VLA, increasing average task completion length by 3.81 and closing 42.4% of the anchor–wrist gap. The figures below summarize video metrics, VLA gains, and ablations.

Quantitative Results

Video quantitative metrics (FVD table)

VLA & World Model Results

Quantitative results and VLA improvements

Ablation Studies

Ablation Result

Visualization

We showcase qualitative results across simulated and real settings. Use the selector below to view side-by-side rollouts where the left panel is the ground-truth wrist view and the right panel is our generated wrist-view video conditioned only on anchor views.

Left: ground truth wrist view. Right: WristWorld generation.

Selected visualization

Real-World Baseline Comparisons

Real-world baseline comparisons

Calvin VLA Results

w/o Ours

w/o Ours

Ours

Ours

Real-World (Franka) VLA Results

w/o Ours — Close the upper drawer — 2× speed

Ours — Close the upper drawer — 2× speed

w/o Ours — Pick up the bread — 2× speed

Ours — Pick up the bread — 2× speed