No description
  • Python 99.3%
  • Dockerfile 0.7%
Find a file
2026-06-12 20:38:00 +02:00
data/images/llama-cpp-turboquant-cuda First commit in new repo 2026-05-30 10:34:40 +02:00
tui Import model from pasted YAML text instead of file 2026-06-12 20:38:00 +02:00
.dockerignore First commit in new repo 2026-05-30 10:34:40 +02:00
.gitignore First commit in new repo 2026-05-30 10:34:40 +02:00
AGENTS.md First commit in new repo 2026-05-30 10:34:40 +02:00
llama-runner First commit in new repo 2026-05-30 10:34:40 +02:00
README.md First commit in new repo 2026-05-30 10:34:40 +02:00
requirements.txt First commit in new repo 2026-05-30 10:34:40 +02:00

llama-runner

Terminal UI for running llama.cpp MoE models in containers. Auto-calculates VRAM budgeting for expert offloading, handles HuggingFace downloads, and launches models with optimal parameters.

Quick Start

pip install -r requirements.txt
python -m tui

Or use the launcher script:

./llama-runner

Requirements

  • Python 3.10+
  • NVIDIA GPU with CUDA
  • Docker or Podman (auto-detected)
  • textual>=0.40.0, pyyaml>=6.0, huggingface_hub, requests

What It Does

  • One model at a time on port 8080. Starting a new model auto-stops the previous container.
  • Cache-type-aware VRAM budgeting — calculates --n-cpu-moe based on your GPU VRAM, KV cache type, context size, and model architecture. Targets ~80% VRAM usage by default.
  • HuggingFace integration — download models by repo:quant spec, with resume support. No CLI tools needed.
  • Vision model support — auto-downloads mmproj, sets appropriate batch sizes.
  • Thin containers — only llama-server inside. All launch logic runs on the host.
  • Open WebUI — built-in screen to manage an Open WebUI container alongside models.

First Run

  1. Launch the TUI
  2. Press i to open Images, then b to build the default container image (llama-cpp-turboquant:cuda)
  3. Select a model and press d to download it
  4. Press s to start it

Dashboard Keys

Key Action
s Start selected model
x Stop running model
d Download selected model
c Configure model (editor screen)
a Add a new model
delete Delete model from list
f Clean up downloaded files
i Image manager
g Settings
w Open WebUI
q Quit

Container Image

The default image (llama-cpp-turboquant:cuda) builds from data/images/llama-cpp-turboquant-cuda/Dockerfile. It compiles llama-server with CUDA support from the turboquant-kv-cache branch and sets ENTRYPOINT ["llama-server"]. The TUI passes all command-line arguments at runtime.

Using with Coding Agents

Once a model is running on port 8080, it exposes an OpenAI-compatible API at http://localhost:8080/v1.

Claude Code

Set ANTHROPIC_BASE_URL to point at the local server:

ANTHROPIC_BASE_URL=http://127.0.0.1:8080 claude

Or add to your shell profile for persistence:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080

OpenCode

OpenCode requires a config file. Add this to ~/.config/opencode/opencode.json (global) or .opencode.json in your project:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "local-model": {
          "name": "Local MoE Model",
          "limit": {
            "context": 262144,
            "output": 65536
          }
        }
      }
    }
  }
}

Then run /models in OpenCode and select the local model.

API

The server exposes an OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local-model","messages":[{"role":"user","content":"Hello"}],"max_tokens":20}'

Config

Stored at ~/.config/llama-runner/config.yaml. Each model has:

  • hf_spec — HuggingFace repo and quant (e.g. unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M)
  • container_name — unique name for the container
  • config.vision — enable multimodal vision support
  • config.n_cpu_moe — override auto-calculated value (None = auto)
  • config.vram_target — target VRAM usage fraction (default 0.80)
  • config.ctx_size — context window (default 262144)
  • config.cache_type_k / cache_type_v — KV cache quantization (default turbo4/turbo3)
  • config.extra_args — additional args appended to llama-server command