Diffusion Models

Diffusion models are a class of generative models that progressively transform simple noise into complex data distributions, such as images or climate fields. Intuitively, they work in two phases:

Forward diffusion: A clean signal is gradually corrupted by adding Gaussian noise, eventually transforming it into a nearly pure Gaussian distribution.
Reverse denoising: A neural network is trained to gradually remove the noise, step by step, reconstructing the original data distribution from the noisy signal.

This can be visualized as follows:

_images/diffusion.png — Two original Gaussian distributions are progressively transformed into normal distribution. A denoising network then reconstructs the original distributions.

The framework implements several diffusion formulations commonly used in state-of-the-art generative modeling:

VE (Variance Exploding)
VP (Variance Preserving)
EDM (Elucidated Diffusion Models)
iDDPM (Improved DDPM)

These formulations are selectable via configuration and can be paired with different neural architectures.

EDM Preconditioning

The EDM preconditioned model stabilizes training by standardizing the scales of inputs, outputs, and targets across varying noise levels:

\[D_\theta(\mathbf{x}; \sigma) = c_{\rm {skip}}(\sigma) \mathbf{x} + c_{\rm{out}}(\sigma) F_\theta\big(c_{\rm{in}}(\sigma) \mathbf{x}; c_\mathrm{noise}(\sigma)\big)\]

Where: - \(\mathbf{x}=\mathbf{y}+\sigma\mathbf{n}\) is the noisy input - \(\mathbf{y}\) is the clean signal - \(\mathbf{n} \sim \mathcal{N}(\mathbf{0}, \mathbf{1})\) is standard Gaussian noise - \(\sigma\) is the noise level - \(F_\theta\) is the underlying neural network

Coefficients:

\[c_\mathrm{in}(\sigma) = 1 / (\sigma_\mathrm{data}^2 + \sigma^2)^{1/2}\]

\[c_\mathrm{skip}(\sigma) = \sigma_\mathrm{data}^2 / (\sigma_\mathrm{data}^2 + \sigma^2)\]

\[c_\mathrm{out}(\sigma) = \sigma \, \sigma_\mathrm{data} / (\sigma_\mathrm{data}^2 + \sigma^2)^{1/2}\]

\[c_\mathrm{noise}(\sigma) =1/4 \log \sigma\]

Loss Function

For each training sample, Gaussian noise \(\sigma \mathbf{n}\) with a randomly selected noise level \(\sigma\) is added to the image. The network is trained with weighted denoising loss:

\[\mathcal{L} = \mathbb{E}_{\sigma, \mathbf{y}, \mathbf{n}} \left[ \lambda(\sigma) \left\| D_{\theta}(\mathbf{y} + \sigma\mathbf{n}, \sigma) - \mathbf{y} \right\|_2^2 \right]\]

Where \(\lambda(\sigma) = (\sigma^2 + \sigma_{\mathrm{data}}^2) / (\sigma \, \sigma_{\mathrm{data}})^2\).

Sampling

High-resolution samples are generated by numerically solving the reverse-time stochastic differential equation (SDE):

Initialize with Gaussian noise \(\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, t_0^2 \mathbf{1})\)
For each step \(i\) from 0 to \(N-1\): - Optionally add temporary noise increment - Compute denoising direction - Update latent with Euler/Heun scheme
Return final denoised sample

Theoretical Background

For theoretical background, see:

Implementation Details

Each diffusion formulation is implemented as a separate class with:

Noise scheduling: Defines \(\sigma(t)\) progression
Sampling methods: Different ODE/SDE solvers
Loss computation: Formulation-specific weighting
Conditioning: Support for various conditioning strategies

Configuration Example

diffusion:
  type: "EDM"
  sigma_data: 1.0
  sigma_min: 0.002
  sigma_max: 80.0
  rho: 7.0
  p_mean: -1.2
  p_std: 1.2

sampling:
  steps: 40
  sampler: "heun"
  s_churn: 40.0
  s_min: 0.05
  s_max: 50.0
  s_noise: 1.003

Comparison of Formulations

VE: Simple, stable, good for continuous data
VP: Common in image generation, well-studied
EDM: State-of-the-art, excellent sample quality
iDDPM: Improved training stability and sample quality