From tune-ik-llama
Use when setting up, tuning, or debugging slow inference for a GGUF model with ik_llama.cpp on a VRAM-constrained NVIDIA GPU. Covers picking --n-cpu-moe, deciding on -rtr, fitting models bigger than VRAM, escaping auto-spill, and pushing context to native max.
How this skill is triggered — by the user, by Claude, or both
Slash command
/tune-ik-llama:tune-ik-llamaThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A working, fast `llama-server` config is the result of a measurement loop, not a guess. **Iron rule:** never hand the user a final command without measuring at least once. **Speed is the diagnostic signal, not VRAM headroom.**
A working, fast llama-server config is the result of a measurement loop, not a guess. Iron rule: never hand the user a final command without measuring at least once. Speed is the diagnostic signal, not VRAM headroom.
Scope: the --n-cpu-moe / -rtr / auto-spill machinery this skill teaches applies to MoE models with routed experts — tested on Qwen 3.6 35B-A3B and Gemma 4 26B-A4B; same procedure should work for Mixtral, DeepSeek-V2-Lite, Phi-3.5-MoE in principle but isn't measured. For dense models (Llama, Gemma 3, Mistral non-MoE, Qwen-dense), --n-cpu-moe and -rtr are no-ops or actively harmful — Phase 5 collapses to "use -ngl 99 if it fits, otherwise drop -ngl until it does". The Phase 1-4 measurement loop and the Phase 6 ctx push still apply unchanged. If you don't know whether a model is MoE, check the GGUF metadata for expert_count / expert_used_count or look at the model name (A3B, A4B, 8x7B, MoE are all signals).
ikawrakow/ik_llama.cpp fork — -rtr and IQ_K_R4 aren't in stock llama.cpp). Verify with llama-server --version or take user's word.nvidia-smi. AMD/Apple: stop.Run nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv. Record model size in GB (add mmproj size for vision). If user is choosing a quant, recommend imatrix-IQ_K — see references/quant-guide.md.
| Model + KV vs free VRAM | Regime | Starting flags |
|---|---|---|
| < ~85% | fits | --n-cpu-moe 0, no -rtr |
| > ~85% | spill-risk | --n-cpu-moe = layer_count / 4, add -rtr |
| ≫ 100% | won't fit | reduce -c, raise --n-cpu-moe |
Common base flags: -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0 --jinja --host 0.0.0.0 --port 8080. Add --mmproj <path> for vision. Add --reasoning-budget -1 for Qwen3 reasoning.
Start llama-server. Send a 5-token chat completion to /v1/chat/completions. Capture two numbers:
predicted_per_second from the response timings block.nvidia-smi used + free VRAM.Auto-spill is consistently 3-5× slower than the fast regime on the same card. Use this relative rule, not absolute numbers — speed depends on the card's memory bandwidth (not VRAM size: a 3070 Ti 8 GB at 608 GB/s outpaces a 4060 Ti 16 GB at 288 GB/s) and on whether the card has native FP4/INT4 hardware (RTX 50-series).
Predict the card's fast-regime ceiling roughly as (memory_bandwidth_GB/s) ÷ (active_params_GB_per_token) — typically 0.5-2 GB/token at IQ3/IQ4 for 3-4B-active MoE. See references/empirical-data.md for measured numbers and per-card examples.
After baseline measurement:
-rtr if absent and sweep --n-cpu-moe upward in steps of 2 until speed jumps sharply (the threshold jump is unmistakable). See references/auto-spill.md.--n-cpu-moe further or lower -c.--n-cpu-moeTarget headroom is ≥1 GB free VRAM (~1024 MiB). 800 MiB is the absolute floor; below that, vision encode spikes, long-prompt prefill buffers, or background apps will push you to OOM.
--n-cpu-moe toward 0 in steps of 1 until VRAM headroom approaches 1 GB. Each step ≈ +1-2 tok/s but eats ~250 MB.Raise -c toward the model's native max (Qwen3.x 256k, Gemma 4 26B-A4B 256k as measured, Gemma 3 / Llama 3.x 128k, Mistral 7B v0.1 32k). Unsure? Check n_ctx_train in llama-server's startup log or ask the user — don't default to 256k. Idle-prompt speed is flat across ctx at q4_0 KV. Stress-test with a 5k-token prompt; lock the highest ctx with ≥1 GB free.
Hand the user the output per references/output-formats.md (bare llama-server command by default; llama-swap config on request).
-rtr "just to be safe" — net-negative on models that fit cleanly.-fmoe to a dense (non-MoE) model._R4 filename suffix as a substitute for runtime -rtr.-c 32768 because "longer is unsafe". With q4_0 KV, idle speed is flat — push to native max.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub pmaeria/tune-ik-llama --plugin tune-ik-llama