No description

Python 99.3%
Dockerfile 0.7%

Find a file

Christian 04408fb593 Import model from pasted YAML text instead of file		2026-06-12 20:38:00 +02:00
data/images/llama-cpp-turboquant-cuda	First commit in new repo	2026-05-30 10:34:40 +02:00
tui	Import model from pasted YAML text instead of file	2026-06-12 20:38:00 +02:00
.dockerignore	First commit in new repo	2026-05-30 10:34:40 +02:00
.gitignore	First commit in new repo	2026-05-30 10:34:40 +02:00
AGENTS.md	First commit in new repo	2026-05-30 10:34:40 +02:00
llama-runner	First commit in new repo	2026-05-30 10:34:40 +02:00
README.md	First commit in new repo	2026-05-30 10:34:40 +02:00
requirements.txt	First commit in new repo	2026-05-30 10:34:40 +02:00

README.md

llama-runner

Terminal UI for running llama.cpp MoE models in containers. Auto-calculates VRAM budgeting for expert offloading, handles HuggingFace downloads, and launches models with optimal parameters.

Quick Start

pip install -r requirements.txt
python -m tui

Or use the launcher script:

./llama-runner

Requirements

Python 3.10+
NVIDIA GPU with CUDA
Docker or Podman (auto-detected)
textual>=0.40.0, pyyaml>=6.0, huggingface_hub, requests

What It Does

One model at a time on port 8080. Starting a new model auto-stops the previous container.
Cache-type-aware VRAM budgeting — calculates --n-cpu-moe based on your GPU VRAM, KV cache type, context size, and model architecture. Targets ~80% VRAM usage by default.
HuggingFace integration — download models by repo:quant spec, with resume support. No CLI tools needed.
Vision model support — auto-downloads mmproj, sets appropriate batch sizes.
Thin containers — only llama-server inside. All launch logic runs on the host.
Open WebUI — built-in screen to manage an Open WebUI container alongside models.

First Run

Launch the TUI
Press i to open Images, then b to build the default container image (llama-cpp-turboquant:cuda)
Select a model and press d to download it
Press s to start it

Dashboard Keys

Key	Action
`s`	Start selected model
`x`	Stop running model
`d`	Download selected model
`c`	Configure model (editor screen)
`a`	Add a new model
`delete`	Delete model from list
`f`	Clean up downloaded files
`i`	Image manager
`g`	Settings
`w`	Open WebUI
`q`	Quit

Container Image

The default image (llama-cpp-turboquant:cuda) builds from data/images/llama-cpp-turboquant-cuda/Dockerfile. It compiles llama-server with CUDA support from the turboquant-kv-cache branch and sets ENTRYPOINT ["llama-server"]. The TUI passes all command-line arguments at runtime.

Using with Coding Agents

Once a model is running on port 8080, it exposes an OpenAI-compatible API at http://localhost:8080/v1.

Claude Code

Set ANTHROPIC_BASE_URL to point at the local server:

ANTHROPIC_BASE_URL=http://127.0.0.1:8080 claude

Or add to your shell profile for persistence:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080

OpenCode

OpenCode requires a config file. Add this to ~/.config/opencode/opencode.json (global) or .opencode.json in your project:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "local-model": {
          "name": "Local MoE Model",
          "limit": {
            "context": 262144,
            "output": 65536
          }
        }
      }
    }
  }
}

Then run /models in OpenCode and select the local model.

API

The server exposes an OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local-model","messages":[{"role":"user","content":"Hello"}],"max_tokens":20}'

Config

Stored at ~/.config/llama-runner/config.yaml. Each model has:

hf_spec — HuggingFace repo and quant (e.g. unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M)
container_name — unique name for the container
config.vision — enable multimodal vision support
config.n_cpu_moe — override auto-calculated value (None = auto)
config.vram_target — target VRAM usage fraction (default 0.80)
config.ctx_size — context window (default 262144)
config.cache_type_k / cache_type_v — KV cache quantization (default turbo4/turbo3)
config.extra_args — additional args appended to llama-server command