- Python 99.3%
- Dockerfile 0.7%
| data/images/llama-cpp-turboquant-cuda | ||
| tui | ||
| .dockerignore | ||
| .gitignore | ||
| AGENTS.md | ||
| llama-runner | ||
| README.md | ||
| requirements.txt | ||
llama-runner
Terminal UI for running llama.cpp MoE models in containers. Auto-calculates VRAM budgeting for expert offloading, handles HuggingFace downloads, and launches models with optimal parameters.
Quick Start
pip install -r requirements.txt
python -m tui
Or use the launcher script:
./llama-runner
Requirements
- Python 3.10+
- NVIDIA GPU with CUDA
- Docker or Podman (auto-detected)
textual>=0.40.0,pyyaml>=6.0,huggingface_hub,requests
What It Does
- One model at a time on port 8080. Starting a new model auto-stops the previous container.
- Cache-type-aware VRAM budgeting — calculates
--n-cpu-moebased on your GPU VRAM, KV cache type, context size, and model architecture. Targets ~80% VRAM usage by default. - HuggingFace integration — download models by
repo:quantspec, with resume support. No CLI tools needed. - Vision model support — auto-downloads mmproj, sets appropriate batch sizes.
- Thin containers — only
llama-serverinside. All launch logic runs on the host. - Open WebUI — built-in screen to manage an Open WebUI container alongside models.
First Run
- Launch the TUI
- Press
ito open Images, thenbto build the default container image (llama-cpp-turboquant:cuda) - Select a model and press
dto download it - Press
sto start it
Dashboard Keys
| Key | Action |
|---|---|
s |
Start selected model |
x |
Stop running model |
d |
Download selected model |
c |
Configure model (editor screen) |
a |
Add a new model |
delete |
Delete model from list |
f |
Clean up downloaded files |
i |
Image manager |
g |
Settings |
w |
Open WebUI |
q |
Quit |
Container Image
The default image (llama-cpp-turboquant:cuda) builds from data/images/llama-cpp-turboquant-cuda/Dockerfile. It compiles llama-server with CUDA support from the turboquant-kv-cache branch and sets ENTRYPOINT ["llama-server"]. The TUI passes all command-line arguments at runtime.
Using with Coding Agents
Once a model is running on port 8080, it exposes an OpenAI-compatible API at http://localhost:8080/v1.
Claude Code
Set ANTHROPIC_BASE_URL to point at the local server:
ANTHROPIC_BASE_URL=http://127.0.0.1:8080 claude
Or add to your shell profile for persistence:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
OpenCode
OpenCode requires a config file. Add this to ~/.config/opencode/opencode.json (global) or .opencode.json in your project:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llama.cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server (local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1"
},
"models": {
"local-model": {
"name": "Local MoE Model",
"limit": {
"context": 262144,
"output": 65536
}
}
}
}
}
}
Then run /models in OpenCode and select the local model.
API
The server exposes an OpenAI-compatible API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local-model","messages":[{"role":"user","content":"Hello"}],"max_tokens":20}'
Config
Stored at ~/.config/llama-runner/config.yaml. Each model has:
hf_spec— HuggingFace repo and quant (e.g.unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M)container_name— unique name for the containerconfig.vision— enable multimodal vision supportconfig.n_cpu_moe— override auto-calculated value (None = auto)config.vram_target— target VRAM usage fraction (default 0.80)config.ctx_size— context window (default 262144)config.cache_type_k/cache_type_v— KV cache quantization (default turbo4/turbo3)config.extra_args— additional args appended to llama-server command