Skill

tune-ik-llama

Use when setting up, tuning, or debugging slow inference for a GGUF model with ik_llama.cpp on a VRAM-constrained NVIDIA GPU. Covers picking --n-cpu-moe, deciding on -rtr, fitting models bigger than VRAM, escaping auto-spill, and pushing context to native max.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/tune-ik-llama:tune-ik-llama

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

A working, fast `llama-server` config is the result of a measurement loop, not a guess. **Iron rule:** never hand the user a final command without measuring at least once. **Speed is the diagnostic signal, not VRAM headroom.**

Supporting Files

references/auto-spill.mdreferences/empirical-data.mdreferences/output-formats.mdreferences/quant-guide.md

SKILL.md

71 lines · ~1.3k tokens

Stats

Parent stars0

MaintenanceGood

Last CommitMay 9, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Tune ik_llama.cpp

A working, fast llama-server config is the result of a measurement loop, not a guess. Iron rule: never hand the user a final command without measuring at least once. Speed is the diagnostic signal, not VRAM headroom.

Scope: the --n-cpu-moe / -rtr / auto-spill machinery this skill teaches applies to MoE models with routed experts — tested on Qwen 3.6 35B-A3B and Gemma 4 26B-A4B; same procedure should work for Mixtral, DeepSeek-V2-Lite, Phi-3.5-MoE in principle but isn't measured. For dense models (Llama, Gemma 3, Mistral non-MoE, Qwen-dense), --n-cpu-moe and -rtr are no-ops or actively harmful — Phase 5 collapses to "use -ngl 99 if it fits, otherwise drop -ngl until it does". The Phase 1-4 measurement loop and the Phase 6 ctx push still apply unchanged. If you don't know whether a model is MoE, check the GGUF metadata for expert_count / expert_used_count or look at the model name (A3B, A4B, 8x7B, MoE are all signals).

Phase 0 — Requirements (abort if missing)

ik_llama.cpp build (ikawrakow/ik_llama.cpp fork — -rtr and IQ_K_R4 aren't in stock llama.cpp). Verify with llama-server --version or take user's word.
NVIDIA GPU with nvidia-smi. AMD/Apple: stop.
A GGUF model file path.

Phase 1 — Pre-flight

Run nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv. Record model size in GB (add mmproj size for vision). If user is choosing a quant, recommend imatrix-IQ_K — see references/quant-guide.md.

Phase 2 — Predict regime

Model + KV vs free VRAM	Regime	Starting flags
< ~85%	fits	`--n-cpu-moe 0`, no `-rtr`
> ~85%	spill-risk	`--n-cpu-moe = layer_count / 4`, add `-rtr`
≫ 100%	won't fit	reduce `-c`, raise `--n-cpu-moe`

Common base flags: -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0 --jinja --host 0.0.0.0 --port 8080. Add --mmproj <path> for vision. Add --reasoning-budget -1 for Qwen3 reasoning.

Phase 3 — Baseline measurement

Start llama-server. Send a 5-token chat completion to /v1/chat/completions. Capture two numbers:

predicted_per_second from the response timings block.
nvidia-smi used + free VRAM.

Phase 4 — Diagnose by speed (CRITICAL)

Auto-spill is consistently 3-5× slower than the fast regime on the same card. Use this relative rule, not absolute numbers — speed depends on the card's memory bandwidth (not VRAM size: a 3070 Ti 8 GB at 608 GB/s outpaces a 4060 Ti 16 GB at 288 GB/s) and on whether the card has native FP4/INT4 hardware (RTX 50-series).

Predict the card's fast-regime ceiling roughly as (memory_bandwidth_GB/s) ÷ (active_params_GB_per_token) — typically 0.5-2 GB/token at IQ3/IQ4 for 3-4B-active MoE. See references/empirical-data.md for measured numbers and per-card examples.

After baseline measurement:

Speed near or above predicted ceiling: fast regime. Skip to Phase 6.
Speed 3-5× lower than ceiling: auto-spilling. Add -rtr if absent and sweep --n-cpu-moe upward in steps of 2 until speed jumps sharply (the threshold jump is unmistakable). See references/auto-spill.md.
OOM at startup: raise --n-cpu-moe further or lower -c.

Phase 5 — Sweep `--n-cpu-moe`

Target headroom is ≥1 GB free VRAM (~1024 MiB). 800 MiB is the absolute floor; below that, vision encode spikes, long-prompt prefill buffers, or background apps will push you to OOM.

Fits: lower --n-cpu-moe toward 0 in steps of 1 until VRAM headroom approaches 1 GB. Each step ≈ +1-2 tok/s but eats ~250 MB.
Spill-risk: find the threshold jump, then raise just enough to reach ≥1 GB headroom.

Phase 6 — Push ctx and lock

Raise -c toward the model's native max (Qwen3.x 256k, Gemma 4 26B-A4B 256k as measured, Gemma 3 / Llama 3.x 128k, Mistral 7B v0.1 32k). Unsure? Check n_ctx_train in llama-server's startup log or ask the user — don't default to 256k. Idle-prompt speed is flat across ctx at q4_0 KV. Stress-test with a 5k-token prompt; lock the highest ctx with ≥1 GB free.

Hand the user the output per references/output-formats.md (bare llama-server command by default; llama-swap config on request).

Red flags — do not do these

Deliver a final command without measuring. If the user says "skip the loop, just give me a command", explain that one measurement takes ~30s and prevents handing them a config 3× slower than necessary. Then measure.
Add -rtr "just to be safe" — net-negative on models that fit cleanly.
Add -fmoe to a dense (non-MoE) model.
Trust the on-disk _R4 filename suffix as a substitute for runtime -rtr.
Accept ~25 tok/s on a 35B-A3B as "this hardware's ceiling". That's auto-spill.
Set -c 32768 because "longer is unsafe". With q4_0 KV, idle speed is flat — push to native max.

tune-ik-llama

Invocation

Context Preview

Supporting Files

SKILL.md

tune-ik-llama

Invocation

Context Preview

Supporting Files

SKILL.md

Tune ik_llama.cpp

Phase 0 — Requirements (abort if missing)

Phase 1 — Pre-flight

Phase 2 — Predict regime

Phase 3 — Baseline measurement

Phase 4 — Diagnose by speed (CRITICAL)

Phase 5 — Sweep `--n-cpu-moe`

Phase 6 — Push ctx and lock

Red flags — do not do these

Similar Skills

Tune ik_llama.cpp

Phase 0 — Requirements (abort if missing)

Phase 1 — Pre-flight

Phase 2 — Predict regime

Phase 3 — Baseline measurement

Phase 4 — Diagnose by speed (CRITICAL)

Phase 5 — Sweep `--n-cpu-moe`

Phase 6 — Push ctx and lock

Red flags — do not do these

Similar Skills

tune-ik-llama

Invocation

Context Preview

Supporting Files

SKILL.md

tune-ik-llama

Invocation

Context Preview

Supporting Files

SKILL.md

Tune ik_llama.cpp

Phase 0 — Requirements (abort if missing)

Phase 1 — Pre-flight

Phase 2 — Predict regime

Phase 3 — Baseline measurement

Phase 4 — Diagnose by speed (CRITICAL)

Phase 5 — Sweep --n-cpu-moe

Phase 6 — Push ctx and lock

Red flags — do not do these

Similar Skills

Tune ik_llama.cpp

Phase 0 — Requirements (abort if missing)

Phase 1 — Pre-flight

Phase 2 — Predict regime

Phase 3 — Baseline measurement

Phase 4 — Diagnose by speed (CRITICAL)

Phase 5 — Sweep --n-cpu-moe

Phase 6 — Push ctx and lock

Red flags — do not do these

Similar Skills

Phase 5 — Sweep `--n-cpu-moe`

Phase 5 — Sweep `--n-cpu-moe`