WonderPlay: Dynamic 3D Scene Generation
from a Single Image and Actions

Zizhang Li*1     Hong-Xing "Koven" Yu*1     Wei Liu1    
Yin Yang2     Charles Herrmann1     Gordon Wetzstein1     Jiajun Wu1    
1Stanford University    2University of Utah
*Contributed Equally

What Actions Can We Apply in This Tea Party?

Click on each action icon tap to see the generated dynamic scenes and input actions.

Full Scene
tap icon
Dripping the honey on the cake
tap icon
Letting wind blown away the hat
tap icon
Adding a breeze
tap icon
Pushing down the wine glass

Action-Conditioned Dynamic 3D Scenes

WonderPlay generates dynamic 3D scenes from a single image and input actions. It predicts the physical consequences of the input actions. Here, we present video results rendered with a moving camera, overlaid with action visualizations.

Different Actions

WonderPlay synthesizes different dynamics scenes from the same image by altering the input actions and predicting the corresponding dynamic outcomes. Click on an image and an action to view the dynamic outcome.

Comparisons with Physics- and Video-Based Methods

WonderPlay supports dynamic 3D scene generation across a wide range of scenes and materials. Here, we show side-by-side comparisons with different physics-based and video generation-based baselines. Please click on the images and baselines buttons to view the comparisons.

Interactive Viewer of Generated Dynamic 3D Scene

View the generated dynamic scene in the interactive viewer below. We extend the generated scene beyond the input image by the WonderWorld approach to create a mini world.
Keyboard: Move by "W/A/S/D", look around by "I/J/K/L".
Touch Screen: Move by one-finger drag, look around by two-finger drag.

Click to Load the Viewer

Interactive Viewer (Click image to load)

Abstract

WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public.

Approach

WonderPlay Method Overview
Given a single image, we first reconstruct the 3D scene and estimate material properties. Then our hybrid generative simulator uses physics solver and input actions to infer coarse 3D dynamics. The simulated appearance and motion signals are used to condition the video generator through spatially varying bimodal control to synthesize the realistic motion. The dynamic 3D scene is refined using the synthesized video, finishing the hybrid generative simulation.

BibTex

@InProceedings{li2025wonderplay,
    title     = {WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions},
    author    = {Li, Zizhang and Yu, Hong-Xing and Liu, Wei and Yang, Yin and Herrmann, Charles and Wetzstein, Gordon and Wu, Jiajun},
    booktitle = {arXiv preprint arXiv:2505.18151},
    year      = {2025},
}