Skip to content

WSD Block-Size Curriculum

Training a model to handle bidirectional context is challenging. To stabilize convergence, we use the Warmup–Stable–Decay (WSD) scheduler for the block size.

The Schedule

The block size \(K\) follows a progressive schedule across the 20,000 training steps:

Phase Steps Block Size Effective Objective
warmup_ar 0 – 1000 1 Pure Autoregressive
warmup_4 1001 – 1500 4 Initial Bidirectional
warmup_32 1501 – 2000 32 Medium Block
warmup_128 2001 – 2500 128 Large Block
warmup_512 2501 – 3000 512 Full Sequence (Diffusion)
stable 3001 – 18000 1024 Steady-state Diffusion
decay_256 18001 – 19000 256 Refinement
decay_64 19001 – 19500 64 Fine-grained
decay_32 19501 – 20000 32 High-precision

Implementation

The scheduler is implemented as a custom TrainerCallback. At each step, the callback: 1. Checks the current step count. 2. Determines the active phase. 3. Updates the trainer.block_size attribute. 4. Logs phase transitions to the console and W&B.