claude-llama

Delegate token-heavy file work to a local llama.cpp model so the bulk content never enters Claude's context.

claude-llama is an MCP server that exposes three tools — llama_summarize, llama_extract, llama_ask — plus a llama_health probe. Claude calls them instead of reading large files itself; the server reads the files locally, hands them to your llama.cpp instance, and returns only the answer.

Every response carries a footer like:

---
[claude-llama] input=7,992 tok · returned=931 tok · saved≈7,061 tok · model=Qwen3.5-9B · 141s

(real numbers from summarizing a 32KB plan doc — see Real-world savings for the full matrix.)

The savings are also appended to a JSONL log; claude-llama-mcp stats summarizes it. CI guards the savings claim with a benchmark.

Install

One-liner (recommended):

curl -fsSL https://raw.githubusercontent.com/vxfemboy/claude-llama/main/install.sh | sh

Downloads the latest release binary for your OS/arch, verifies the checksum, drops it in ~/.local/bin, and runs claude-llama-mcp init.

As a Claude Code plugin:

/plugin marketplace add vxfemboy/claude-llama
/plugin install claude-llama:claude-llama

(then /reload-plugins)

From source:

go install github.com/vxfemboy/claude-llama/cmd/claude-llama-mcp@latest
claude-llama-mcp init

After installing, register it with your MCP client. For Claude Code, add to your project's .mcp.json:

{
  "mcpServers": {
    "claude-llama": { "command": "claude-llama-mcp" }
  }
}

Configuration

All settings are environment variables. claude-llama-mcp init writes them to ~/.config/claude-llama/env (honoring $XDG_CONFIG_HOME); the process env always wins over the file.

Variable	Default	Purpose
`LLAMA_API_URL`	`http://localhost:8080`	llama.cpp server (OpenAI-compatible)
`LLAMA_MODEL`	`unsloth/Qwen3.5-9B-GGUF:Q4_K_M`	model name passed to `/v1/chat/completions`
`LLAMA_MAX_INPUT_TOKENS`	`6000`	max tokens per chunk before map/reduce kicks in
`LLAMA_TIMEOUT_SECONDS`	`120`	per-call timeout
`LLAMA_WORKSPACE_ROOT`	cwd	path-traversal boundary; the server refuses to read outside it
`LLAMA_FOOTER`	`true`	append the per-call savings footer to each response
`LLAMA_USAGE_LOG`	`true`	append a JSONL row per call to `$XDG_STATE_HOME/claude-llama/usage.jsonl`

Set any value to 0, false, no, or off to disable a boolean.

Tools

llama_summarize (paths, focus?) — summarize files/dirs/globs.
llama_extract (paths, query) — pull only snippets matching query.
llama_ask (prompt, paths?) — delegate a self-contained task; paths are optional context.
llama_health () — JSON status: {ok, url, models, latency_ms, error}. Lets Claude self-diagnose before relying on the MCP for a big job.

Real-world savings

In the wild

Two llama_summarize calls during a single cross-project session (separate Rust repo, same Qwen3.5-9B Q8 model on hack-mini:8080):

Call	Input tok	Returned tok	Saved	Duration
`src/` + `README.md` + `Cargo.toml`	34,247	535	33,712	10m48s
config + docker + `scripts/` + `tests/`	3,914	528	3,386	1m51s
Total	38,161	1,063	37,098	12m39s

~97% of bulk file content kept out of Claude's context at a cost of ~13 minutes of local inference. Pulled from claude-llama-mcp stats --json.

Benchmark matrix

Measured against this repo's own files (Qwen3.5-9B Q8, local hardware — your mileage will vary with model + GPU):

claude-llama

Popularity

What's Inside

README

claude-llama

Install

Configuration

Tools

Real-world savings

In the wild

Benchmark matrix

Confidence

Similar Plugins

claude-mem

nanobanana

human-resources

product-management

Popularity

Health & Quality

Similar Plugins

claude-mem

nanobanana

human-resources

product-management

marketing

sales