Designing urban spaces that provide pedestrian wind comfort and safety requires time-resolved Computational Fluid Dynamics (CFD) simulations, but their current computational cost makes extensive design exploration impractical. We introduce WinDiNet (Wind Diffusion Network), a pretrained video diffusion model that is repurposed as a fast, differentiable surrogate for this task. Starting from LTX-Video, a 2B-parameter latent video transformer, we fine-tune on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. A systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies, including a physics-informed decoder loss, identifies a configuration that outperforms purpose-built neural PDE solvers. The resulting model generates full 112-frame rollouts in under a second. As the surrogate is end-to-end differentiable, it doubles as a physics simulator for gradient-based inverse optimization: given an urban footprint layout, we optimize building positions directly through backpropagation to improve wind safety as well as pedestrian wind comfort. Experiments on single- and multi-inlet layouts show that the optimizer discovers effective layouts even under challenging multi-objective configurations, with all improvements confirmed by ground-truth CFD simulations.
(a) Procedurally generated urban layouts are simulated with a 2D incompressible Euler solver to produce training data. (b) A latent diffusion model with a physics-informed VAE is trained to generate wind field sequences conditioned on building footprint, inlet speed uin, and domain size L. (c) At inference, the model generates horizontal and vertical velocity fields (u, v) and enables gradient-based inverse optimization of building layouts.
We generate 10,000 transient 2D incompressible Euler flow simulations over procedurally generated building layouts. Building footprints consist of randomly placed rectangular blocks within a circular city region whose diameter is sampled from {300, 400, ..., 800} m. Inlet wind speeds are sampled from [0.1, 20] m/s, and wind direction is sampled uniformly from [0, 360]° then canonicalized to left-to-right flow.
Each frame maps the two velocity components (u, v) to the red and green channels of an RGB image by linearly rescaling with a dataset-wide maximum speed so that values lie in [−1, 1]. The blue channel encodes the fluid mask.
RGB-encoded training samples
The pretrained LTX-Video VAE introduces reconstruction artifacts that degrade physical fidelity. We compare the base VAE against two decoder fine-tuning strategies: a pure MSE loss and a physics-informed loss that additionally penalizes divergence and momentum residuals. The physics-informed variant recovers fine-scale flow structures most faithfully.
Ground Truth
Base VAE
VRMSE: 0.188
MSE Fine-tuned
VRMSE: 0.087
Physics Fine-tuned
VRMSE: 0.070
We compare WinDiNet against purpose-built neural PDE solvers (OFormer, RNO, FNO, AFNO, U-Net, Poseidon) as well as our own ablations: LoRA fine-tuning and text-based conditioning. The best variant, Dec. FT Physics (fine-tuned decoder with physics-informed losses, scalar conditioning), outperforms all baselines on pointwise metrics while generating all 112 frames in a single forward pass in under a second.
Variance-Normalized RMSE (VRMSE) across all models. Grey: baselines, light blue: WinDiNet text-conditioned, dark blue: WinDiNet scalar-conditioned.
At inference, the user provides a building footprint, an inlet speed, and a domain size. WinDiNet generates a full 112-frame velocity rollout in under a second. Rotating the building footprint is equivalent to changing the wind direction relative to the buildings. Use the sliders below to explore predictions over an example building layout across different inlet speeds, rotations, and time steps. The output is displayed as wind speed magnitude with a coolwarm colormap for easier interpretation.
Since WinDiNet is end-to-end differentiable, it can serve as a physics simulator for gradient-based inverse optimization of building layouts. Building positions are iteratively updated to minimize a comfort loss that penalizes dangerous gusts, speeds outside the desired comfort range, and stagnant zones in a downstream objective region.
(a) Movable buildings (red) and objective region (green rectangle) are defined. (b) A differentiable rasterizer produces a soft occupancy mask. (c) WinDiNet predicts the wind field. (d) Comfort loss penalizes speeds outside the target band. (e) Gradient descent updates building positions. (f) Optimized layout concentrates speeds within the comfort band.
Select a comfort band below to see how the optimizer reshapes the layout for different target conditions.
Initial Layout
Optimized Layout
Initial Footprint
Speed Distribution
@article{perinibischof2026windinet,
title = {Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows},
author = {Perini, Janne and Bischof, Rafael and Arar, Moab and Duran, Ay\c{c}a and Kraus, Michael A. and Mishra, Siddhartha and Bickel, Bernd},
journal = {arXiv preprint arXiv:2603.21210},
year = {2026}
}