Name: evalhub
Author: eval-hub

eval-hub-skills

Agent Skills for EvalHub, following the agentskills.io open format.

These skills enable AI coding agents (Claude Code, Copilot, Codex, etc.) to discover and execute EvalHub model evaluations during development sessions.

Skills

Skill	Description
`evalhub`	Full skill — discovery, evaluation, job lifecycle, and EDD workflows
`evalhub-discovery`	Discover providers, benchmarks, and collections; read agent metadata
`evalhub-eval`	Submit evaluation jobs against benchmarks or collections
`evalhub-jobs`	Monitor, wait on, cancel, and fetch logs for evaluation jobs

Installation

Prerequisites

Python 3.11+
uv (scripts use PEP 723 inline metadata for auto-dependency resolution)
Network access to an EvalHub service
EVALHUB_BASE_URL, EVALHUB_TOKEN, EVALHUB_TENANT environment variables

Install via Claude Code plugin (recommended)

/plugin marketplace add eval-hub/eval-hub-skills
/plugin install evalhub@evalhub

The skill is then available as /evalhub:evalhub in any Claude Code session.

Install locally (development)

Clone the repo and symlink the skills into ~/.claude/skills/:

git clone https://github.com/eval-hub/eval-hub-skills
cd eval-hub-skills
make install-all   # installs all four skills

To install only the primary skill:

make install

Changes to the skill source are reflected immediately without reinstalling.

Connect to an MCP server on a cluster

If EvalHub exposes an MCP server on your cluster, you can register it directly with Claude Code using the claude mcp add CLI command. EvalHub's MCP requires a bearer token and an x-tenant header (the namespace):

claude mcp add evalhub "$EVALHUB_BASE_URL/mcp" \
  --transport http \
  --header "Authorization: Bearer $EVALHUB_TOKEN" \
  --header "x-tenant: $EVALHUB_TENANT"

This writes the server into your local Claude Code config (.claude/settings.json). Use --scope user to register it globally across all projects instead.

Note: OpenShift tokens expire. If you get 401 errors, refresh with export EVALHUB_TOKEN="$(oc whoami -t)" and re-run the command.

Uninstall

make uninstall-all   # remove all skills
make uninstall       # remove primary skill only

Update

make update-all

Validate

make check

Configuration

Set these environment variables before using the skill:

export EVALHUB_BASE_URL="https://evalhub.apps.cluster.example.com"
export EVALHUB_TOKEN="$(oc whoami -t)"
export EVALHUB_TENANT="eval-test"

Optionally, for clusters with self-signed certificates:

export EVALHUB_INSECURE=true

Usage

Once installed, the skill is automatically discovered by Claude Code. Ask naturally in any session.

Example: discover what providers are available

You: What evaluation providers are available?

The skill fetches live metadata from the API and summarises each provider:

Available providers:

- AgentDojo — Test agent resilience to prompt injection across realistic tool-use
  scenarios. Targets: agent. Evaluates: agent-safety, prompt-injection,
  tool-use-security.

- LM Evaluation Harness — Comprehensive evaluation framework with 180 benchmarks
  covering math, science, coding, safety, reasoning, and more.

- Garak — LLM vulnerability scanner and red-teaming framework.

- GuideLLM — Performance benchmarking for LLM inference servers.

- SWE-bench — Software engineering benchmark using real-world GitHub issues.

- Lighteval — Lightweight LLM evaluation from Hugging Face.

Example: find providers for a specific use case

You: Which providers can evaluate my agent for prompt injection?

[
  {
    "id": "agentdojo",
    "name": "AgentDojo",
    "summary": "Test agent resilience to prompt injection across realistic tool-use scenarios",
    "target_type": "agent",
    "evaluates": ["agent-safety", "prompt-injection", "tool-use-security"]
  }
]

Claude: AgentDojo is the right fit here. It runs your agent through realistic tool-use scenarios (workspace, Slack, banking, travel) and measures both utility (task completion) and security (injection resistance). The model must support tool/function calling via an OpenAI-compatible API.

evalhub

Popularity

What's Inside

README