Search everything...

Stats

Actions

Available In

web-bench

Name: web-bench
Author: lespaceman

By lespaceman

WebBench benchmark runner — executes real-world browser tasks from the Halluminate/WebBench dataset, scores via LLM-as-judge, and produces evaluation reports

npx claudepluginhub lespaceman/athena-workflow-marketplace --plugin web-bench

Popularity

Stars

Above avg

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Skills5

evaluate-task

/evaluate-task

Evaluate whether a WebBench task was successfully completed using LLM-as-judge scoring. Triggers: "evaluate task", "score task", "judge result", "grade benchmark task". Examines the execution trace, final page state, and extracted data against the original task description. Produces a structured verdict (PASS/PARTIAL/FAIL) with reasoning. Does NOT execute browser actions — use execute-task for that.

execute-task

/execute-task

Methodology for executing a single WebBench benchmark task via browser automation. Triggers: "execute task", "run task", "perform benchmark task", "browser task". Interprets the natural-language task description, defines the required browser actions, and specifies what final evidence to capture (for example screenshot + snapshot). Records an execution trace with actions taken and errors encountered. Does NOT evaluate success — use evaluate-task for that.

generate-report

/generate-report

Aggregate WebBench benchmark results into a comprehensive evaluation report. Triggers: "generate report", "create benchmark report", "summarize results", "aggregate scores", "produce evaluation report". Reads web-bench-results.jsonl, computes statistics by category/website/failure mode, and writes web-bench-report.md with pass rates, timing, token usage, and analysis. Does NOT execute or evaluate tasks — only aggregates existing results.

load-dataset

/load-dataset

Download and prepare the Halluminate/WebBench dataset from HuggingFace for benchmarking. Triggers: "load dataset", "download WebBench", "prepare benchmark data", "fetch tasks". Downloads the CSV dataset via curl, converts to JSONL with Node.js, applies optional filters (category, sample size, website allowlist/blocklist), and writes web-bench-tasks.jsonl to the working directory. Zero Python dependencies — uses only curl and Node.js. Does NOT execute tasks — use execute-task for that.

run-benchmark

/run-benchmark

Run the WebBench browser agent benchmark — main entry point and orchestrator. Triggers: "run benchmark", "run WebBench", "start benchmark", "benchmark browser agent", "web bench", "execute WebBench", "run web-bench". Parses user configuration (category filter, sample size, resume), delegates to load-dataset, execute-task, evaluate-task, and generate-report skills. This is the user-invocable orchestrator that ties the full benchmark pipeline together.

MCP Servers1

agent-web-interface

Stats

Version1.0.5

LanguagePython

Stars2

MaintenanceExcellent

Last CommitApr 2, 2026

AddedMar 28, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

athena-workflow-marketplace3

README

Athena Plugin Marketplace

Marketplace repository for:

Claude Code plugins (plugins/)
Athena workflows (workflows/)

Spec

Workflow and manifest contracts are defined in:

Use this RFC as the source of truth for workflow behavior, lifecycle, and cross-runtime plugin compatibility (skills, tools, and sub-agent/task patterns).

Repository Structure

.
├── .claude-plugin/
│   └── marketplace.json            # Plugin catalog
├── .athena-workflow/
│   └── marketplace.json            # Workflow catalog
├── workflows/
│   └── e2e-test-builder/
│       ├── workflow.json
│       └── system_prompt.md
└── plugins/
    ├── e2e-test-builder/
    ├── md-export/
    └── site-knowledge/

Runtime Artifacts

Each plugin package now generates versioned runtime artifacts during npm pack / npm publish:

dist/<version>/release.json
dist/<version>/claude/plugin/
dist/<version>/codex/plugin/
dist/<version>/codex/marketplace.json

The generated runtime plugin directories are packaged artifacts, not source mirrors:

They keep the runtime-specific overlay for the target runtime plus the shared skill source.
They exclude transient local install state such as node_modules/ and lockfiles.
They do not retain repo-only build hooks that depend on this repository layout.

Build them locally for one plugin with:

cd plugins/<plugin-name>
npm run build:artifacts

Skill Compatibility And Validation

This repo uses a split skill metadata model so the same skills stay compatible with both Claude and Codex:

Portable skill core lives in SKILL.md
Claude-specific invocation metadata lives in agents/claude.yaml
OpenAI/Codex UI metadata lives in agents/openai.yaml

For the full conventions, see docs/skills-compatibility.md.

Official Vendor Config vs Repo Overlays

Official vendor-aligned configuration lives outside this packaging layer:

Claude Code officially documents project config in .claude/settings.json and .claude/settings.local.json
Codex officially documents MCP/client configuration via codex mcp add ... and ~/.codex/config.toml

This repo also defines its own packaging overlays for distributing skills and plugins across runtimes:

.claude-plugin/plugin.json
.codex-plugin/plugin.json
.agents/plugins/marketplace.json
agents/openai.yaml
agents/claude.yaml

These repo overlay files are conventions used by this repository. They are not presented here as official vendor-standard file formats.

Local Environment

The repo includes a local Python 3.12 virtualenv at .venv for running the official Agent Skills validator.

Activate it with:

source .venv/bin/activate

Validation Commands

Run the official portable validator across all plugin skills:

scripts/validate-skills-portable.sh

Run the repo-specific compatibility checks:

scripts/validate-skills-repo.sh

Run the lightweight local validator on a single skill:

scripts/quick-validate-skill.sh plugins/e2e-test-builder/skills/write-test-code

Authoring Commands

Create a new repo-compatible skill scaffold:

scripts/init-compatible-skill.py my-skill --path plugins/my-plugin/skills --interface display_name="My Skill" --interface short_description="Describe the skill in the UI" --interface default_prompt="Run my skill." --argument-hint "<arg>"

Generate or update Claude-only overlay metadata for an existing skill:

scripts/generate-claude-yaml.py plugins/my-plugin/skills/my-skill --frontmatter user-invocable=true --frontmatter argument-hint="<arg>"

Metadata Placement

Use these rules when editing or adding skills:

Put only portable Agent Skills frontmatter in SKILL.md
Put Claude-only fields like argument-hint and user-invocable in agents/claude.yaml
Put menu copy like display_name, short_description, and default_prompt in agents/openai.yaml

Source Of Truth And Regeneration

Treat SKILL.md as the hand-authored source of truth for the skill itself
scripts/init-compatible-skill.py scaffolds SKILL.md, agents/openai.yaml, and agents/claude.yaml
scripts/generate-claude-yaml.py regenerates agents/claude.yaml for an existing skill
agents/claude.yaml should be treated as generated overlay metadata and may be overwritten when regenerated
agents/openai.yaml is created during scaffolding and may also be replaced if you rerun scaffold or metadata-generation flows for that skill
If you hand-edit generated metadata files, assume those edits can be lost on regeneration unless you also update the generator inputs or process

Install This Marketplace (Claude Plugin Consumers)

View full README on GitHub

web-bench

Popularity

What's Inside

Confidence

README

Athena Plugin Marketplace

Spec

Repository Structure

Runtime Artifacts

Skill Compatibility And Validation

Official Vendor Config vs Repo Overlays

Local Environment

Validation Commands

Authoring Commands

Metadata Placement

Source Of Truth And Regeneration

Install This Marketplace (Claude Plugin Consumers)

Similar Plugins

performance-test-suite

api-benchmarker

e2e-test-runner

chrome-devtools

openbrowser

dev-browser

More by lespaceman

agent-web-interface

md-export

site-knowledge

Athena Plugin Marketplace

Spec

Repository Structure

Runtime Artifacts

Skill Compatibility And Validation

Official Vendor Config vs Repo Overlays

Local Environment

Validation Commands

Authoring Commands

Metadata Placement

Source Of Truth And Regeneration

Install This Marketplace (Claude Plugin Consumers)

Popularity

Health & Quality

More by lespaceman

agent-web-interface

md-export

site-knowledge

Similar Plugins

performance-test-suite

api-benchmarker

e2e-test-runner

chrome-devtools

openbrowser

dev-browser