tune-ik-llama
A Claude Code / Codex plugin that gives your agent a procedure for tuning ik_llama.cpp on a VRAM-constrained NVIDIA GPU.
When you ask an agent to "set up this GGUF model on my GPU as fast as possible", the agent loads this skill and walks through a concrete six-phase workflow: detect hardware, predict whether the model will fit, measure baseline speed, diagnose the regime, sweep --n-cpu-moe, and push context to the model's native max — instead of guessing flags.
What this is for
MoE models on ik_llama.cpp. This is the new class of highly efficient, consumer-hardware-friendly LLMs — big total parameter counts but tiny active parameter counts per token, so they punch way above their weight on a 12–16 GB GPU. The two models the skill was tested against:
- Qwen 3.6 35B-A3B (3B active out of 35B total) — most users will be here. Measured on a 16 GB 4060 Ti: ~95 tok/s for IQ3_K_R4, ~75 tok/s for IQ4_K_R4. Vision variant (
Qwen3.6-VL) also works.
- Gemma 4 26B-A4B (4B active out of 26B total) — ~80 tok/s for IQ4_XS on the same card.
Other routed-expert architectures (Mixtral, DeepSeek-V2-Lite, Phi-3.5-MoE, etc.) should work in principle — same flags, same six-phase procedure — but aren't measured. Numbers in references/empirical-data.md are only for the two above.
Why MoE? Because generation only reads the active experts per token (3–4 B out of 26–35 B+ total), so you can fit a big-quality model on a small GPU if you split which experts live on GPU vs CPU correctly. ik_llama.cpp is the runtime that makes that fast: it adds three MoE-specific features stock llama.cpp doesn't have:
--n-cpu-moe N — explicit per-layer expert offload to CPU
-rtr (run-time repack) — repacks tensors into the row-interleaved R4 layout at load time, so CPU-resident experts run at SIMD speed
-fmoe — fused MoE kernel, sizable speedup for the per-token expert routing
The skill walks an agent through tuning all three on the user's specific hardware.
Not for dense models. Llama 3.x, Gemma 3 (the dense one), Mistral non-MoE, Qwen-dense — the MoE knobs are no-ops or actively harmful. Use stock llama.cpp flags (-ngl 99 if it fits on the GPU, otherwise reduce -ngl until it does). This skill doesn't help you there.
Why this exists
ik_llama.cpp has two interacting features that are easy to get wrong:
- Auto-spill: when a model doesn't quite fit on the GPU, the runtime silently pushes more MoE layers to CPU than you asked. This drops generation throughput from ~80 tok/s to ~25 tok/s on 35B-A3B-class MoE models. Speed alone is the symptom — VRAM looks normal.
-rtr (run-time repack): repacks tensors to a SIMD-friendly layout. Required to escape auto-spill. Net-negative on models that already fit. The on-disk _R4 filename suffix does not substitute for -rtr.
A typical agent without this skill will pick generic flags, accept the slow regime, and never push context past the user's stated minimum. With this skill loaded, the agent diagnoses by speed, applies -rtr only when needed, sweeps --n-cpu-moe correctly, and lands on a working command.
Getting started
Four steps from "freshly built PC" to "running a 35B model on your GPU at 80+ tok/s".
1. Compile ik_llama.cpp
There are no pre-built binaries — you compile from source. Rough flow on Windows + CUDA:
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
You'll need CMake, a C++ compiler (Visual Studio Build Tools on Windows; gcc/clang on Linux), and the CUDA toolkit. Full build instructions for all platforms are in the upstream README.
The binary lands at build/bin/Release/llama-server.exe (Windows) or build/bin/llama-server (Linux/macOS).
2. Download a starter model
For a 12–16 GB NVIDIA GPU, start with Qwen 3.6 35B-A3B — a 35B-parameter MoE with only 3B active per token, so generation is fast despite the size. Two imatrix-calibrated quants prepared with the row-interleaved R4 layout:
| Quant | Size | Best for |
|---|
Qwen3.6-35B-A3B-IQ3_K_R4 | ~14 GB | Speed-first. Fits cleanly on 12 GB+ cards. ~95 tok/s measured on a 16 GB 4060 Ti. |
Qwen3.6-35B-A3B-IQ4_K_R4 | ~18.5 GB | Quality-first. Will auto-spill on 12–16 GB cards; the skill handles it. ~75 tok/s measured on the same card. |
Smaller GPU? Step down to a smaller model (Gemma 3 12B Q4_K_M, Mistral 7B). Larger GPU (24 GB+)? IQ4_K_R4 fits cleanly and goes faster.
3. Save it somewhere
Anywhere — the skill makes no assumptions about folder layout. Just remember the path: