Training Guide
This guide explains how to execute and monitor a MiCA-BD3LM training run.
Environment Setup
Ensure you have the virtual environment activated:
Starting a Run
We use task-spooler (tsp) and systemd-run to manage jobs and enforce memory limits on the NVIDIA GB10 system.
tsp systemd-run --user --scope \
-p MemoryMax=100G \
-p MemorySwapMax=0 \
-p MemoryHigh=90G \
env WANDB_API_KEY=your_key_here \
WANDB_ENTITY=your_entity_here \
WANDB_PROJECT=nano-dllm \
WANDB_MODE=online \
python scripts/train.py \
--output_dir ./outputs/your-run-name \
--max_steps 20000 \
--learning_rate 1e-4
Key Parameters
--mica_rank: Rank of the adapters (default: 32).--resume_from_checkpoint: Path to a directory (e.g.,./outputs/run1/checkpoint-1000) to continue training.--report_to: Set to"wandb"for live tracking.
Monitoring
Terminal
Monitor the live logs using tsp:
Weights & Biases
If --report_to wandb is active, you can view live charts for:
* train/loss: Training loss.
* train/accuracy: Token-level prediction accuracy at masked positions.
* gsm8k/accuracy: Post-checkpoint benchmark results.
* system/gpu_utilization: Hardware health.