From llm-externalizer
Select GGUF artifacts and quantizations for llama.cpp on CPU, Mac Metal, CUDA, or ROCm runtimes. Covers Q4_K_M vs Q5_K_M vs Q6_K trade-offs, llama-server launch flags, --hf-repo/--hf-file fallback for non-standard naming, and conversion from Transformers weights when no GGUF exists. Use when the user picks llama.cpp / LM Studio / Ollama on non-Apple-Silicon platforms. Loaded by llm-externalizer-setup-agent.
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-externalizer:huggingface-local-modelsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant for the target hardware, and launch the model with `llama-cli` or `llama-server`. For MLX on Apple Silicon, see `huggingface-mlx-models` instead.
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant for the target hardware, and launch the model with llama-cli or llama-server. For MLX on Apple Silicon, see huggingface-mlx-models instead.
The setup wizard's scripts/setup/recommend-models.py emits, for every recommended model, the list of GGUF artifacts on HF (whatcani.run runtime=llama.cpp filter) with pre-built download_command lines. The wizard runs that command verbatim. This skill is consulted for:
llama-server launch flags (context size, threading, KV-cache offload, flash-attention)--hf-repo / --hf-file fallback for non-standard namingapps=llama.cpp HF Hub filters for obscure reposThe wizard does NOT call this skill on Apple Silicon arm64 when the user picked MLX as the runtime — huggingface-mlx-models handles that path.
External requirements:
llama.cpp installed (brew install llama.cpp, winget install llama.cpp, or build from source)hf CLI authenticated for gated repos (hf auth login)apps=llama.cpp.https://huggingface.co/<repo>?local-app=llama.cpp..gguf filenames with https://huggingface.co/api/models/<repo>/tree/main?recursive=true.llama-cli -hf <repo>:<QUANT> or llama-server -hf <repo>:<QUANT>.--hf-repo plus --hf-file when the repo uses custom file naming.Return the recommended GGUF artifact + launch command + verified smoke-test result to the user.
See launch-recipes.md §Failure modes: custom file naming, no GGUF artifact, gated repo, smoke-test fails.
# Install + auth + serve
brew install llama.cpp
hf auth login
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
# Run an exact GGUF file with explicit context size
llama-server --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -c 4096
Install llama.cpp · Authenticate for gated repos · Search the Hub · Run directly from the Hub · Run an exact GGUF file · Convert only when no GGUF is available · Smoke test a local server · Quant Choice · Failure modes
Core URLs · Search for llama.cpp-compatible models · Use the local-app page for the recommended quant · Confirm exact files from the tree API · Build the command · Example:
unsloth/Qwen3.6-35B-A3B-GGUF· Notes
Hub-first quant selection · Quantization Formats · Converting Models · K-Quantization Methods · Quality Testing · Use Case Guide · Model Size Scaling · Finding Pre-Quantized Models · Importance Matrices (
imatrix) · Troubleshooting
Apple Silicon (Metal) · NVIDIA (CUDA) · AMD (ROCm) · CPU
https://github.com/ggml-org/llama.cpphttps://huggingface.co/docs/hub/gguf-llamacpphttps://huggingface.co/docs/hub/main/local-appshttps://huggingface.co/spaces/ggml-org/gguf-my-repoCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub emasoft/emasoft-plugins --plugin llm-externalizer