LLM Backends
The backend layer lives in cyberpaw/agent/backends/ and provides a uniform async interface over local LLM inference engines.
Interface
Section titled “Interface”backends/base.py defines LLMBackend:
class LLMBackend(abc.ABC): async def load(self, model_path: str, on_progress: Callable[[int], None]) -> None: ... def is_loaded(self) -> bool: ... def unload(self) -> None: ... async def generate(self, prompt: str, params: GenerateParams) -> AsyncIterator[str]: ... def count_tokens(self, text: str) -> int: ... def chat_template(self) -> str | None: ... def eos_strings(self) -> list[str]: ... def context_size(self) -> int: ...generate() is an async generator — it yields individual token strings as they are produced by the model.
llama.cpp backend
Section titled “llama.cpp backend”backends/llamacpp_backend.py — wraps llama-cpp-python for GGUF model inference.
Loading: runs in a thread via asyncio.to_thread to keep the event loop responsive. Reports load progress (0–100%) via the on_progress callback.
Inference: runs in a thread pool via loop.run_in_executor. Tokens are passed back to the async generator through an asyncio.Queue.
Chat template: reads the Jinja2 chat template embedded in the GGUF metadata (tokenizer.chat_template). Falls back to a hand-rolled Gemma template if absent.
Stop sequences: reads EOS token IDs from GGUF metadata keys tokenizer.ggml.eos_token_id, tokenizer.ggml.eot_token_id, and tokenizer.ggml.eom_token_id to build a complete stop sequence list.
Auto-selection and context sizing
Section titled “Auto-selection and context sizing”backends/selector.py — calculate_context_size() determines the largest safe context window for the available hardware:
budget_gb = total_RAM_gb × 0.75kv_budget_gb = budget_gb − model_weight_gbraw_n_ctx = kv_budget_gb × 1024³ / kv_bytes_per_tokenn_ctx = round_down_to_power_of_2(raw_n_ctx)n_ctx = clamp(n_ctx, 4096, 65536)KV cache bytes per token for known models:
| Model | Bytes/token |
|---|---|
| Gemma 4 E2B | 106,496 |
| Gemma 4 E4B | 139,264 |
| Unknown | 80,000 (conservative) |
The 65,536 token cap is a hard limit — n_ctx=131072 causes llama_decode errors on Apple Silicon with llama-cpp-python 0.3.x.
GenerateParams
Section titled “GenerateParams”@dataclassclass GenerateParams: max_new_tokens: int = 4096 temperature: float = 1.0 top_p: float = 0.95 repetition_penalty: float = 1.1 stop_sequences: list[str] = ["<end_of_turn>", "</start_of_turn>"]Temperature is set per model family: 0.0 for Gemma (deterministic coding), 0.2 for others.
Prompt rendering
Section titled “Prompt rendering”prompt/model_template.py — renders the full prompt string for each generation step:
- Reads the Jinja2 chat template from the GGUF metadata
- Converts
Messageobjects to HF-style{"role": ..., "content": ...}dicts - Serialises
ToolUseBlockandToolResultBlockas XML inside the content field - Renders with
add_generation_prompt=Trueto append the assistant turn opener - Falls back to the hand-rolled Gemma template if Jinja2 is unavailable or the template fails