From llm-externalizer
Run rigorous evaluations against Hugging Face Hub models using inspect-ai or lighteval on local hardware. Covers backend selection (vLLM / Transformers / accelerate), local GPU evals, smoke tests, and task selection. Use when the user wants a deeper benchmark than the wizard's 5-test compatibility check. Loaded by llm-externalizer-setup-agent.
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-externalizer:huggingface-community-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run rigorous evaluations against Hugging Face Hub models on local hardware via `inspect-ai` or `lighteval`. Use AFTER the model has passed the wizard's compatibility check, when the user explicitly asks for a deeper benchmark (MMLU, HumanEval, etc.). NOT for HF Jobs orchestration, model-card PRs, `.eval_results` publication, or community-evals automation.
Run rigorous evaluations against Hugging Face Hub models on local hardware via inspect-ai or lighteval. Use AFTER the model has passed the wizard's compatibility check, when the user explicitly asks for a deeper benchmark (MMLU, HumanEval, etc.). NOT for HF Jobs orchestration, model-card PRs, .eval_results publication, or community-evals automation.
The setup wizard's scripts/setup/test-model.py runs 5 fast calibrated tests:
smoke — basic completionstructured_output — JSON-schema response_format (the plugin's hard compatibility requirement)code_understanding — find an off-by-one bug in 4 lines of Python, return strict JSONlong_context — ~30K-token input, return a one-sentence summaryoutput_length — emit ≥4K tokens before stoppingThis is sufficient for confirming a model can run llm-externalizer scans, code reviews, and reports.
Consult this skill ONLY when:
The skill's output is INFORMATIONAL. The wizard's pass/fail verdict comes from test-model.py, not this skill.
External: uv run, HF_TOKEN for gated models, GPU for local runs (nvidia-smi).
Follow the workflow in evaluation-recipes.md:
--limit 10 or --max-samples 10).hf jobs CLI (see the hf-cli skill and https://huggingface.co/docs/huggingface_hub/en/guides/jobs).Return the evaluation results table (scored metric per task per model), with smoke-test pass/fail clearly indicated, and the next-step recommendation (scale up locally / hand off to HF Jobs / try a different backend).
See evaluation-recipes.md §Error Handling: CUDA/vLLM OOM, model unsupported by vllm (switch backend), gated repo (verify HF_TOKEN), custom model code (--trust-remote-code).
# Smoke test on a tiny model with mmlu via providers
uv run scripts/inspect_eval_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --limit 20
# Local GPU vLLM run on gsm8k
uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task gsm8k --limit 20
# lighteval with accelerate fallback
uv run scripts/lighteval_vllm_uv.py --model microsoft/phi-2 --tasks "leaderboard|mmlu|5" --backend accelerate --max-samples 20
Detailed scope · When To Use Which Script · Tools required · Core Workflow · Quick Start · Remote Execution Boundary · Task Selection · Backend Selection · Hardware Guidance · Error Handling · Examples
What this skill covers · What this skill does NOT cover · Setup · inspect-ai examples · lighteval examples · Hand-off to Hugging Face Jobs
scripts/inspect_eval_uv.py — inspect-ai eval via Inference Providersscripts/inspect_vllm_uv.py — inspect-ai with local vLLM / HF backendscripts/lighteval_vllm_uv.py — lighteval with vLLM / accelerate backendhttps://github.com/UKGovernmentBEIS/inspect_aihttps://github.com/huggingface/lightevalhttps://huggingface.co/docs/inference-providersnpx claudepluginhub emasoft/emasoft-plugins --plugin llm-externalizerCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.