Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows

Abstract

Designing urban spaces that provide pedestrian wind comfort and safety requires time-resolved Computational Fluid Dynamics (CFD) simulations, but their current computational cost makes extensive design exploration impractical. We introduce WinDiNet (Wind Diffusion Network), a pretrained video diffusion model that is repurposed as a fast, differentiable surrogate for this task. Starting from LTX-Video, a 2B-parameter latent video transformer, we fine-tune on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. A systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies, including a physics-informed decoder loss, identifies a configuration that outperforms purpose-built neural PDE solvers. The resulting model generates full 112-frame rollouts in under a second. As the surrogate is end-to-end differentiable, it doubles as a physics simulator for gradient-based inverse optimization: given an urban footprint layout, we optimize building positions directly through backpropagation to improve wind safety as well as pedestrian wind comfort. Experiments on single- and multi-inlet layouts show that the optimizer discovers effective layouts even under challenging multi-objective configurations, with all improvements confirmed by ground-truth CFD simulations.

Pipeline

(a) Procedurally generated urban layouts are simulated with a 2D incompressible Euler solver to produce training data. (b) A latent diffusion model with a physics-informed VAE is trained to generate wind field sequences conditioned on building footprint, inlet speed u_in, and domain size L. (c) At inference, the model generates horizontal and vertical velocity fields (u, v) and enables gradient-based inverse optimization of building layouts.

Dataset

We generate 10,000 transient 2D incompressible Euler flow simulations over procedurally generated building layouts. Building footprints consist of randomly placed rectangular blocks within a circular city region whose diameter is sampled from {300, 400, ..., 800} m. Inlet wind speeds are sampled from [0.1, 20] m/s, and wind direction is sampled uniformly from [0, 360]° then canonicalized to left-to-right flow.

Each frame maps the two velocity components (u, v) to the red and green channels of an RGB image by linearly rescaling with a dataset-wide maximum speed so that values lie in [−1, 1]. The blue channel encodes the fluid mask.

RGB-encoded training samples

Results

VAE Adaptation

The pretrained LTX-Video VAE introduces reconstruction artifacts that degrade physical fidelity. We compare the base VAE against two decoder fine-tuning strategies: a pure MSE loss and a physics-informed loss that additionally penalizes divergence and momentum residuals. The physics-informed variant recovers fine-scale flow structures most faithfully.

Ground Truth

Base VAE

VRMSE: 0.188

MSE Fine-tuned

VRMSE: 0.087

Physics Fine-tuned

VRMSE: 0.070

Model Comparison

We compare WinDiNet against purpose-built neural PDE solvers (OFormer, RNO, FNO, AFNO, U-Net, Poseidon) as well as our own ablations: LoRA fine-tuning and text-based conditioning. The best variant, Dec. FT Physics (fine-tuned decoder with physics-informed losses, scalar conditioning), outperforms all baselines on pointwise metrics while generating all 112 frames in a single forward pass in under a second.

Variance-Normalized RMSE (VRMSE) comparison across all models

Variance-Normalized RMSE (VRMSE) across all models. Grey: baselines, light blue: WinDiNet text-conditioned, dark blue: WinDiNet scalar-conditioned.

Interactive Demo

At inference, the user provides a building footprint, an inlet speed, and a domain size. WinDiNet generates a full 112-frame velocity rollout in under a second. Rotating the building footprint is equivalent to changing the wind direction relative to the buildings. Use the sliders below to explore predictions over an example building layout across different inlet speeds, rotations, and time steps. The output is displayed as wind speed magnitude with a coolwarm colormap for easier interpretation.

Inlet Speed 10 m/s

Rotation 0°

Time Step Play ▶

Inverse Optimization

Since WinDiNet is end-to-end differentiable, it can serve as a physics simulator for gradient-based inverse optimization of building layouts. Building positions are iteratively updated to minimize a comfort loss that penalizes dangerous gusts, speeds outside the desired comfort range, and stagnant zones in a downstream objective region.

(a) Movable buildings (red) and objective region (green rectangle) are defined. (b) A differentiable rasterizer produces a soft occupancy mask. (c) WinDiNet predicts the wind field. (d) Comfort loss penalizes speeds outside the target band. (e) Gradient descent updates building positions. (f) Optimized layout concentrates speeds within the comfort band.

Select a comfort band below to see how the optimizer reshapes the layout for different target conditions.

Target Comfort Band

Initial Layout

Optimized Layout

Initial Footprint

Speed Distribution

BibTeX

@article{perinibischof2026windinet,
  title     = {Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows},
  author    = {Perini, Janne and Bischof, Rafael and Arar, Moab and Duran, Ay\c{c}a and Kraus, Michael A. and Mishra, Siddhartha and Bickel, Bernd},
  journal   = {arXiv preprint arXiv:2603.21210},
  year      = {2026}
}