nano-dllm Developer Documentation

Welcome to the technical documentation for nano-dllm. This project implements a high-efficiency training recipe for converting pretrained autoregressive (AR) language models into Block Diffusion Language Models (BD3LM).

Project Vision

The goal of nano-dllm is to bridge the gap between fast, causal generation and the high-quality, non-autoregressive refinement capabilities of diffusion models. By using Minor Component Adaptation (MiCA), we adapt large pretrained models (specifically Qwen3-0.6B) with minimal parameter overhead while preserving their massive causal knowledge.

Core Components

MiCA PEFT: Parameter-efficient fine-tuning that targets the minor singular directions of weight matrices.
BD3LM: A block-wise masked diffusion objective that enables bidirectional context within a sequence.
WSD Curriculum: A Warmup–Stable–Decay scheduler that progressively increases block sizes to stabilize training.

Recent Enhancements

Memory Optimizations: Integrated gradient checkpointing and dynamic micro-batching to handle quadratic attention memory scaling on NVIDIA GB10 hardware.
Automated Benchmarking: A post-checkpoint callback that automatically evaluates model performance on GSM8K across multiple configurations (AR baseline, Zero-MiCA, and trained adapters).
Weights & Biases Integration: Live tracking of training loss, token-level accuracy, and benchmark results.

Quick Links

Training Guide: How to start and monitor a training run.
GSM8K Protocol: Understanding our evaluation metrics.
Resource Management: Optimizing for the Blackwell (GB10) architecture.