Training Strategy
=================

Training is performed **globally**, using a **random block strategy**:

- Spatial blocks are randomly sampled across the globe
- Enables scalable training on very large climate datasets
- Reduces memory footprint while preserving global coverage
- Improves generalization across regions

This design allows a single model to learn global dynamics while remaining usable
for regional inference.

Random Block Sampling
---------------------

During each training epoch, :math:`s` spatial blocks of size :math:`144\times360`
are generated, with block centers placed randomly. The longitude of each block
is treated as periodic, while the latitude is constrained within valid global
boundaries.

.. figure:: ../../images/random_block_sampling.png
   :width: 80%
   :align: center

   Example of randomly sampled spatial blocks used during training

Several values for the number of spatial blocks per epoch (:math:`s=6`, 9, and 12)
were evaluated, and using 12 blocks was identified as an effective balance
between computational efficiency and spatial diversity.

Coarse-Down-Up Procedure
------------------------

A coarse-down-up procedure based on bilinear interpolation is used to separate
large-scale and fine-scale components:

1. **Coarsen**: High-resolution field :math:`\mathbf{y}^{\mathrm{HR}}` is reduced
   to :math:`16\times32` resolution
2. **Upscale**: Coarse field is scaled back to original resolution, yielding
   :math:`\mathbf{y}^{\mathrm{CU}}`
3. **Residual**: Fine-scale information :math:`\mathbf{R} = \mathbf{y}^{\mathrm{HR}} - \mathbf{y}^{\mathrm{CU}}`
   serves as training target

Conditioning Inputs
-------------------

The model is conditioned on:

1. **Coarse-up fields**: Low-resolution approximations
2. **Geographical variables**: Latitude, longitude, topography (:math:`z`), land-sea mask (LSM)
3. **Temporal information**: Cosine-sine representations of day of year and hour of day

.. figure:: ../../images/workflow.png
   :width: 100%
   :align: center

   Workflow of IPSL-AID's training process.

Training Schedule
-----------------

- **Dataset**: ERA5 2015-2019 (train), 2020 (validation), 2021 (test)
- **Batch size**: 80 (optimized for 4× NVIDIA A100 64GB)
- **Epochs**: 100
- **Optimizer**: Adam with learning rate scheduling
- **Validation**: Every epoch on held-out year

Computational Requirements
--------------------------

- **GPUs**: 4× NVIDIA A100 (64 GB each)
- **Time**: ~6 days for full training
- **Memory**: ~200GB GPU memory during training
- **Storage**: Sufficient space for datasets and checkpoints

Hyperparameter Tuning
---------------------

Key hyperparameters:

1. **Learning rate**: Typically :math:`10^{-4}` to :math:`10^{-3}`
2. **Batch size**: Limited by GPU memory, typically 32-128
3. **Block size**: :math:`144\times360` provides good trade-off
4. **Number of blocks**: 12 per epoch for global coverage
5. **Weight decay**: :math:`10^{-6}` for regularization

Monitoring and Logging
----------------------

- **Loss curves**: Training and validation loss
- **Metrics**: MAE, RMSE, R² on validation set
- **Visualizations**: Sample predictions during training
- **Checkpoints**: Save best model and regular intervals

Early Stopping
--------------

Training stops when validation loss doesn't improve for specified number of epochs
(typically 10-20).

Multi-GPU Training
------------------

- **Data Parallel**: Split batches across GPUs
- **Model Parallel**: Split model across GPUs (for very large models)
- **Distributed Data Parallel**: Synchronized gradients across nodes

Reproducibility
---------------

- **Random seeds**: Fixed for reproducibility
- **Configuration saving**: Full config saved with each run
- **Version control**: Code and environment specifications
- **Checkpointing**: Model states saved at regular intervals