eval-hub-skills

Agent Skills for EvalHub, following the agentskills.io open format.
These skills enable AI coding agents (Claude Code, Copilot, Codex, etc.) to discover and execute EvalHub model evaluations during development sessions.
Skills
| Skill | Description |
|---|
evalhub | Full skill — discovery, evaluation, job lifecycle, and EDD workflows |
evalhub-discovery | Discover providers, benchmarks, and collections; read agent metadata |
evalhub-eval | Submit evaluation jobs against benchmarks or collections |
evalhub-jobs | Monitor, wait on, cancel, and fetch logs for evaluation jobs |
Installation
Prerequisites
- Python 3.11+
- uv (scripts use PEP 723 inline metadata for auto-dependency resolution)
- Network access to an EvalHub service
EVALHUB_BASE_URL, EVALHUB_TOKEN, EVALHUB_TENANT environment variables
Install via Claude Code plugin (recommended)
/plugin marketplace add eval-hub/eval-hub-skills
/plugin install evalhub@evalhub
The skill is then available as /evalhub:evalhub in any Claude Code session.
Install locally (development)
Clone the repo and symlink the skills into ~/.claude/skills/:
git clone https://github.com/eval-hub/eval-hub-skills
cd eval-hub-skills
make install-all # installs all four skills
To install only the primary skill:
make install
Changes to the skill source are reflected immediately without reinstalling.
Connect to an MCP server on a cluster
If EvalHub exposes an MCP server on your cluster, you can register it directly with Claude Code using the claude mcp add CLI command. EvalHub's MCP requires a bearer token and an x-tenant header (the namespace):
claude mcp add evalhub "$EVALHUB_BASE_URL/mcp" \
--transport http \
--header "Authorization: Bearer $EVALHUB_TOKEN" \
--header "x-tenant: $EVALHUB_TENANT"
This writes the server into your local Claude Code config (.claude/settings.json). Use --scope user to register it globally across all projects instead.
Note: OpenShift tokens expire. If you get 401 errors, refresh with export EVALHUB_TOKEN="$(oc whoami -t)" and re-run the command.
Uninstall
make uninstall-all # remove all skills
make uninstall # remove primary skill only
Update
make update-all
Validate
make check
Configuration
Set these environment variables before using the skill:
export EVALHUB_BASE_URL="https://evalhub.apps.cluster.example.com"
export EVALHUB_TOKEN="$(oc whoami -t)"
export EVALHUB_TENANT="eval-test"
Optionally, for clusters with self-signed certificates:
export EVALHUB_INSECURE=true
Usage
Once installed, the skill is automatically discovered by Claude Code. Ask naturally in any session.
Example: discover what providers are available
You: What evaluation providers are available?
The skill fetches live metadata from the API and summarises each provider:
Available providers:
- AgentDojo — Test agent resilience to prompt injection across realistic tool-use
scenarios. Targets: agent. Evaluates: agent-safety, prompt-injection,
tool-use-security.
- LM Evaluation Harness — Comprehensive evaluation framework with 180 benchmarks
covering math, science, coding, safety, reasoning, and more.
- Garak — LLM vulnerability scanner and red-teaming framework.
- GuideLLM — Performance benchmarking for LLM inference servers.
- SWE-bench — Software engineering benchmark using real-world GitHub issues.
- Lighteval — Lightweight LLM evaluation from Hugging Face.
Example: find providers for a specific use case
You: Which providers can evaluate my agent for prompt injection?
[
{
"id": "agentdojo",
"name": "AgentDojo",
"summary": "Test agent resilience to prompt injection across realistic tool-use scenarios",
"target_type": "agent",
"evaluates": ["agent-safety", "prompt-injection", "tool-use-security"]
}
]
Claude: AgentDojo is the right fit here. It runs your agent through realistic
tool-use scenarios (workspace, Slack, banking, travel) and measures both utility
(task completion) and security (injection resistance). The model must support
tool/function calling via an OpenAI-compatible API.
Example: list available evaluation collections