From huggingface-skills
Searches Hugging Face Hub for llama.cpp GGUF models, selects quants, and runs locally via llama-cli or llama-server with OpenAI-compatible API.
How this skill is triggered — by the user, by Claude, or both
Slash command
/huggingface-skills:huggingface-local-modelsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant, and launch the model with `llama-cli` or `llama-server`.
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant, and launch the model with llama-cli or llama-server.
apps=llama.cpp.https://huggingface.co/<repo>?local-app=llama.cpp..gguf filenames with https://huggingface.co/api/models/<repo>/tree/main?recursive=true.llama-cli -hf <repo>:<QUANT> or llama-server -hf <repo>:<QUANT>.--hf-repo plus --hf-file when the repo uses custom file naming.brew install llama.cpp
winget install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make
hf auth login
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server \
--hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \
--hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
-c 4096
hf download <repo-without-gguf> --local-dir ./model-src
python convert_hf_to_gguf.py ./model-src \
--outfile model-f16.gguf \
--outtype f16
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about exception handling"}
]
}'
?local-app=llama.cpp page.UD-Q4_K_M instead of normalizing them.Q4_K_M unless the repo page or hardware profile suggests otherwise.Q5_K_M or Q6_K for code or technical workloads when memory allows.Q3_K_M, Q4_K_S, or repo-specific IQ / UD-* variants for tighter RAM or VRAM budgets.mmproj-*.gguf files as projector weights, not the main checkpoint.imatrix.https://github.com/ggml-org/llama.cpphttps://huggingface.co/docs/hub/gguf-llamacpphttps://huggingface.co/docs/hub/main/local-appshttps://huggingface.co/docs/hub/agents-localhttps://huggingface.co/spaces/ggml-org/gguf-my-reponpx claudepluginhub huggingface/skills --plugin trl-trainingOptimizes local LLM inference, model selection, VRAM usage, and deployment using Ollama, llama.cpp, vLLM, LM Studio. Covers GGUF/EXL2 quantization and privacy-first setups for offline AI apps.
Configures Mozilla Llamafile to run GGUF models locally with OpenAI-compatible API. Manages installation, server startup, GPU/CPU configs, SDK integrations, and troubleshooting.
Resolves Hugging Face models locally via `model-shelf` (GGUF, MLX, safetensors) instead of direct download. Triggers on load/run/use requests for local LLMs.