claude-llama
Delegate token-heavy file work to a local llama.cpp model so the bulk content never enters Claude's context.
claude-llama is an MCP server that exposes three tools — llama_summarize, llama_extract, llama_ask — plus a llama_health probe. Claude calls them instead of reading large files itself; the server reads the files locally, hands them to your llama.cpp instance, and returns only the answer.
Every response carries a footer like:
---
[claude-llama] input=7,992 tok · returned=931 tok · saved≈7,061 tok · model=Qwen3.5-9B · 141s
(real numbers from summarizing a 32KB plan doc — see Real-world savings for the full matrix.)
The savings are also appended to a JSONL log; claude-llama-mcp stats summarizes it. CI guards the savings claim with a benchmark.
Install
One-liner (recommended):
curl -fsSL https://raw.githubusercontent.com/vxfemboy/claude-llama/main/install.sh | sh
Downloads the latest release binary for your OS/arch, verifies the checksum, drops it in ~/.local/bin, and runs claude-llama-mcp init.
As a Claude Code plugin:
/plugin marketplace add vxfemboy/claude-llama
/plugin install claude-llama:claude-llama
(then /reload-plugins)
From source:
go install github.com/vxfemboy/claude-llama/cmd/claude-llama-mcp@latest
claude-llama-mcp init
After installing, register it with your MCP client. For Claude Code, add to your project's .mcp.json:
{
"mcpServers": {
"claude-llama": { "command": "claude-llama-mcp" }
}
}
Configuration
All settings are environment variables. claude-llama-mcp init writes them to ~/.config/claude-llama/env (honoring $XDG_CONFIG_HOME); the process env always wins over the file.
| Variable | Default | Purpose |
|---|
LLAMA_API_URL | http://localhost:8080 | llama.cpp server (OpenAI-compatible) |
LLAMA_MODEL | unsloth/Qwen3.5-9B-GGUF:Q4_K_M | model name passed to /v1/chat/completions |
LLAMA_MAX_INPUT_TOKENS | 6000 | max tokens per chunk before map/reduce kicks in |
LLAMA_TIMEOUT_SECONDS | 120 | per-call timeout |
LLAMA_WORKSPACE_ROOT | cwd | path-traversal boundary; the server refuses to read outside it |
LLAMA_FOOTER | true | append the per-call savings footer to each response |
LLAMA_USAGE_LOG | true | append a JSONL row per call to $XDG_STATE_HOME/claude-llama/usage.jsonl |
Set any value to 0, false, no, or off to disable a boolean.
Tools
llama_summarize (paths, focus?) — summarize files/dirs/globs.
llama_extract (paths, query) — pull only snippets matching query.
llama_ask (prompt, paths?) — delegate a self-contained task; paths are optional context.
llama_health () — JSON status: {ok, url, models, latency_ms, error}. Lets Claude self-diagnose before relying on the MCP for a big job.
Real-world savings
In the wild
Two llama_summarize calls during a single cross-project session
(separate Rust repo, same Qwen3.5-9B Q8 model on hack-mini:8080):
| Call | Input tok | Returned tok | Saved | Duration |
|---|
src/ + README.md + Cargo.toml | 34,247 | 535 | 33,712 | 10m48s |
config + docker + scripts/ + tests/ | 3,914 | 528 | 3,386 | 1m51s |
| Total | 38,161 | 1,063 | 37,098 | 12m39s |
~97% of bulk file content kept out of Claude's context at a cost of
~13 minutes of local inference. Pulled from
claude-llama-mcp stats --json.
Benchmark matrix
Measured against this repo's own files (Qwen3.5-9B Q8, local hardware —
your mileage will vary with model + GPU):