From skillry-optional-specialist
Use when you need to review local LLM runtimes, model routing, resource limits, privacy boundaries, and offline behavior.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-optional-specialist:79-local-llm-runtime-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review a local LLM runtime setup — Ollama, llama.cpp, LM Studio, vLLM, or similar — for model selection, quantization correctness, VRAM/RAM fit, context length configuration, throughput characteristics, model routing logic, and privacy boundary enforcement. Produces concrete configuration recommendations with expected performance impact — not generic "optimize your hardware" advice.
Review a local LLM runtime setup — Ollama, llama.cpp, LM Studio, vLLM, or similar — for model selection, quantization correctness, VRAM/RAM fit, context length configuration, throughput characteristics, model routing logic, and privacy boundary enforcement. Produces concrete configuration recommendations with expected performance impact — not generic "optimize your hardware" advice.
Inventory the hardware. Record: GPU model and total VRAM, RAM capacity and speed (DDR4 vs DDR5, frequency), CPU model and core count, storage type and speed (NVMe PCIe gen 4 vs gen 3 vs SATA SSD). Calculate available VRAM after OS, display driver, and any other running GPU processes: available_vram = total_vram_GB - os_reserved_GB. All model loading and routing decisions must fit within this constraint.
Audit the runtime configuration. For each runtime in scope:
Ollama: check OLLAMA_MAX_LOADED_MODELS (default 1; increase only if VRAM supports multiple concurrent models), OLLAMA_NUM_PARALLEL (concurrent request handling), OLLAMA_MAX_VRAM (VRAM allocation cap), OLLAMA_KEEP_ALIVE (model unload timeout — set to 0 if VRAM is shared with other processes). Review the Modelfile for each deployed model.
llama.cpp / llama-server: check -ngl (number of GPU layers — must equal total layers for full GPU execution), -c (context size in tokens), -t (CPU thread count — set to physical core count, not hyperthreaded), --mlock (lock model in RAM to prevent swap), --no-mmap (force full RAM load vs memory-mapped I/O).
LM Studio: check GPU offload percentage, context length, batch size, CPU threads allocated.
Document current values and compare against the hardware inventory to identify misconfigurations.
model_vram_GB = (parameter_count_B × bits_per_weight / 8) + kv_cache_GB
KV cache size:
kv_cache_GB = 2 × n_layers × n_heads × head_dim × context_length × batch_size × bytes_per_element / 1e9
Example: Llama 3 8B at Q4_K_M (4.5 bits average) ≈ 4.5 GB model weights + 0.5 GB KV cache at 4K context = ~5 GB total. A GPU with 8 GB VRAM can fully load this with 3 GB available for KV cache growth.
If the model does not fully fit in VRAM: document how many layers will be CPU-offloaded and the expected throughput penalty (CPU offload typically reduces generation speed by 5-20x depending on PCIe bandwidth).
| Level | Bits/weight | Quality loss | Use case |
|---|---|---|---|
| Q2_K | ~2.6 | Severe | Prototyping only |
| Q3_K_M | ~3.4 | High | Acceptable only for low-stakes tasks |
| Q4_K_M | ~4.5 | Minimal | Default for most production use cases |
| Q5_K_M | ~5.7 | Very low | Use when output quality matters more than memory |
| Q6_K | ~6.6 | Negligible | Near-lossless with significant memory saving vs Q8 |
| Q8_0 | ~8.5 | None | Use only when VRAM is not a constraint |
Flag any use of Q2_K or Q3_K in a production system processing user-facing or consequential outputs without documented quality trade-off acceptance.
-c / num_ctx) meets three conditions:-ngl, or memory-mapped I/O overhead).tcpdump or a network monitoring tool during a generation request; confirm no outbound connections are made during inferenceollama pull <model> is not run automatically in a background job without human reviewVRAM overflow to RAM without understanding the penalty. The model is configured with -ngl 99 but does not fit in VRAM. The runtime silently offloads layers to RAM and uses PCIe bandwidth for every forward pass. On PCIe 4.0 x16 (64 GB/s bidirectional), each layer crossing the PCIe bus adds latency. A 7B model with 5 layers offloaded drops from 80 tok/s to 15 tok/s. Reduce -ngl to the number that fits entirely in VRAM; it is faster to run fewer layers on GPU than to have constant PCIe traffic.
Wrong quantization for the task type. A multi-step reasoning task (code generation, structured analysis) is running on Q2_K to fit a 70B model in 24 GB VRAM. The output fails consistency checks that Q4_K_M of a smaller 13B model would pass. Q2_K of a large model does not outperform Q4_K_M of a smaller model on reasoning tasks — the quantization loss outweighs the parameter count benefit below Q4. Use Q4_K_M as the minimum for any reasoning-heavy task.
Context length set to maximum for all requests. The runtime is configured with num_ctx=128000 for every request because "more context is always better." At 128K context, the KV cache for a Llama 3 8B model consumes approximately 16 GB — more than a consumer GPU's total VRAM. The model runs entirely on CPU. Set context length to the 95th percentile of actual required context for the use case, not the model's theoretical maximum.
Model routing by capability without privacy classification check. The router sends simple queries to the local model and complex queries to the cloud model. The complexity check runs first. A complex query containing patient health records is classified as "complex" and routed to the cloud endpoint before any privacy classification is evaluated. Privacy classification must be the first routing gate, not a secondary filter.
No throughput baseline before optimization. The team adjusts num_parallel, mlock, and layer allocation across three sessions without any recorded baseline. They believe performance improved because "it feels faster." Without a recorded TTFT and tokens/second measurement from the original configuration, there is no objective way to verify improvement. Always record the benchmark before the first change.
Silent model updates via automated pull. A cron job runs ollama pull llama3 nightly. The model tag llama3 is a floating pointer. One night, the upstream model is updated with a new version. The behavioral change is not noticed for two weeks. For production systems, pin to a specific digest: ollama run llama3@sha256:abc123... and treat updates as deployments requiring review.
Telemetry sending prompt content. LM Studio is configured with default telemetry settings. The telemetry includes "usage analytics" that contain prompt metadata. For a system processing medical records or legal documents, this violates data handling requirements. Audit every telemetry field before enabling the runtime on any sensitive data workload. Disable all telemetry that cannot be confirmed as non-content.
Produce a local LLM runtime review report with:
npx claudepluginhub fluxonlab/skillry --plugin skillry-optional-specialistProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.