WSD Block-Size Curriculum
Training a model to handle bidirectional context is challenging. To stabilize convergence, we use the Warmup–Stable–Decay (WSD) scheduler for the block size.
The Schedule
The block size \(K\) follows a progressive schedule across the 20,000 training steps:
| Phase | Steps | Block Size | Effective Objective |
|---|---|---|---|
warmup_ar |
0 – 1000 | 1 | Pure Autoregressive |
warmup_4 |
1001 – 1500 | 4 | Initial Bidirectional |
warmup_32 |
1501 – 2000 | 32 | Medium Block |
warmup_128 |
2001 – 2500 | 128 | Large Block |
warmup_512 |
2501 – 3000 | 512 | Full Sequence (Diffusion) |
stable |
3001 – 18000 | 1024 | Steady-state Diffusion |
decay_256 |
18001 – 19000 | 256 | Refinement |
decay_64 |
19001 – 19500 | 64 | Fine-grained |
decay_32 |
19501 – 20000 | 32 | High-precision |
Implementation
The scheduler is implemented as a custom TrainerCallback. At each step, the callback:
1. Checks the current step count.
2. Determines the active phase.
3. Updates the trainer.block_size attribute.
4. Logs phase transitions to the console and W&B.