FrameDiffuser — G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering.
We propose FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames from G-buffer data. While single-image models like RGB↔X lack temporal consistency and video models like DiffusionRenderer require complete sequences upfront, our approach enables frame-by-frame generation for interactive applications where future frames depend on user input. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data — geometry, materials, and surface properties — while using its previously generated frame for temporal guidance, maintaining stable generation over hundreds to thousands of frames with realistic lighting, shadows, and reflections.
Our dual-conditioning architecture combines ControlNet for structural guidance from G-buffer with ControlLoRA for temporal coherence from the previous frame. ControlNet processes a 10-channel input comprising basecolor, normals, depth, roughness, metallic, and an irradiance channel derived from the previous output. ControlLoRA conditions on the previous frame encoded in VAE latent space. Our three-stage training strategy — starting with black irradiance, introducing temporal conditioning, then self-conditioning — enables stable autoregressive generation without error accumulation.

FrameDiffuser transforms G-buffer data into photorealistic rendering with accurate lighting, shadows, and reflections. We train environment-specific models for six different Unreal Engine 5 environments, demonstrating how specialization achieves superior consistency within specific domains. Our method achieves high visual quality while maintaining temporal consistency across extended sequences.
Hillside Sample Project
Downtown West
Electric Dreams
City Sample
Comparison — Compared to X→RGB from RGB↔X, our method achieves more realistic lighting with high-detail illumination while maintaining temporal consistency across frames over long sequences. X→RGB produces images that appear artificially flat with uniform lighting, lacking the rich lighting variation, shadow depth, and atmospheric effects present in photorealistic rendering.

Scene Editing — When objects are added to the scene through G-buffer modifications, FrameDiffuser automatically synthesizes appropriate lighting, shading, and cast shadows. This enables artists to maintain full control over scene composition while FrameDiffuser handles the computationally expensive lighting synthesis automatically.

Open Positions
Interested in persuing a PhD in computer graphics?
Never miss an update
Join us on Twitter / X for the latest updates of our research group and more.
Recent Work