From llm-externalizer
Find the highest-scoring models for a coding task by querying the official Hugging Face benchmark leaderboards, with memory-budget filtering and per-device fit. Use when the recommender script returns no compatible row and the user has explicitly widened the search. Loaded by llm-externalizer-setup-agent.
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-externalizer:huggingface-bestThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Finds best models for a task by querying official HF benchmark leaderboards, enriching with model size data, filtering for device fit, and returning a comparison table with benchmark scores. Secondary widening path beyond `scripts/setup/recommend-models.py`.
Finds best models for a task by querying official HF benchmark leaderboards, enriching with model size data, filtering for device fit, and returning a comparison table with benchmark scores. Secondary widening path beyond scripts/setup/recommend-models.py.
The setup wizard's primary model-selection source is scripts/setup/recommend-models.py. That script:
final_score, headroom_gb, and pre-built download_command linesThe wizard MUST try the recommender first. This skill is consulted ONLY when both:
compatible: true row at the user's RAM tierWhen invoked, this skill reads HF leaderboards directly — slower than the recommender and without RAM-aware filtering. After returning a candidate, the wizard MUST re-apply the memory-budget check from recommend-models.py before recommending.
External requirements:
hf CLI authenticated (hf auth login)curl and jq on PATHFollow six steps documented in leaderboard-workflow.md:
/api/datasets?filter=benchmark:official.safetensors.total, license).Return the comparison table to the user with recommended pick starred and per-device fit annotated. See leaderboard-workflow.md §Step 6 for the exact format.
hub_repo_search with filters=["<task>"] sorted by trendingScorehub_repo_search for popular task-tagged models, flag results as popularity-rankedInput: "best coding model for my M2 16 GB MacBook"
curl -s "https://huggingface.co/api/datasets?filter=benchmark:official&limit=500" | jq '...'
curl -s "https://huggingface.co/api/datasets/openai/humaneval/leaderboard" | jq '.[:15]'
hf models info qwen/Qwen2.5-Coder-7B --json
Output:
| # | Model | Params | HumanEval | MBPP | License | On device |
|---|-------|--------|-----------|------|---------|-----------|
| ⭐1 | [qwen/Qwen2.5-Coder-7B](https://huggingface.co/qwen/Qwen2.5-Coder-7B) | 7B | 85.2% | 79.1% | Apache 2.0 | Yes (fp16) |
| 2 | [deepseek-ai/deepseek-coder-13b](https://huggingface.co/deepseek-ai/deepseek-coder-13b) | 13B | 83.1% | 71.5% | MIT | Q4 only |
Step 1: Parse the request · Step 2: Find relevant benchmark datasets · Step 3: Fetch top models from leaderboards · Step 4: Enrich with model metadata · Step 5: Filter and rank · Step 6: Output · Examples
https://huggingface.co/api/datasets?filter=benchmark:officialhttps://huggingface.co/api/models/<repo_id>https://huggingface.co/docs/huggingface_hub/en/guides/jobsscripts/setup/recommend-models.py (primary path)Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub emasoft/emasoft-plugins --plugin llm-externalizer