By lespaceman
WebBench benchmark runner — executes real-world browser tasks from the Halluminate/WebBench dataset, scores via LLM-as-judge, and produces evaluation reports
Evaluate whether a WebBench task was successfully completed using LLM-as-judge scoring. Triggers: "evaluate task", "score task", "judge result", "grade benchmark task". Examines the execution trace, final page state, and extracted data against the original task description. Produces a structured verdict (PASS/PARTIAL/FAIL) with reasoning. Does NOT execute browser actions — use execute-task for that.
Methodology for executing a single WebBench benchmark task via browser automation. Triggers: "execute task", "run task", "perform benchmark task", "browser task". Interprets the natural-language task description, defines the required browser actions, and specifies what final evidence to capture (for example screenshot + snapshot). Records an execution trace with actions taken and errors encountered. Does NOT evaluate success — use evaluate-task for that.
Aggregate WebBench benchmark results into a comprehensive evaluation report. Triggers: "generate report", "create benchmark report", "summarize results", "aggregate scores", "produce evaluation report". Reads web-bench-results.jsonl, computes statistics by category/website/failure mode, and writes web-bench-report.md with pass rates, timing, token usage, and analysis. Does NOT execute or evaluate tasks — only aggregates existing results.
Download and prepare the Halluminate/WebBench dataset from HuggingFace for benchmarking. Triggers: "load dataset", "download WebBench", "prepare benchmark data", "fetch tasks". Downloads the CSV dataset via curl, converts to JSONL with Node.js, applies optional filters (category, sample size, website allowlist/blocklist), and writes web-bench-tasks.jsonl to the working directory. Zero Python dependencies — uses only curl and Node.js. Does NOT execute tasks — use execute-task for that.
Run the WebBench browser agent benchmark — main entry point and orchestrator. Triggers: "run benchmark", "run WebBench", "start benchmark", "benchmark browser agent", "web bench", "execute WebBench", "run web-bench". Parses user configuration (category filter, sample size, resume), delegates to load-dataset, execute-task, evaluate-task, and generate-report skills. This is the user-invocable orchestrator that ties the full benchmark pipeline together.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Marketplace repository for:
plugins/)workflows/)Workflow and manifest contracts are defined in:
Use this RFC as the source of truth for workflow behavior, lifecycle, and cross-runtime plugin compatibility (skills, tools, and sub-agent/task patterns).
.
├── .claude-plugin/
│ └── marketplace.json # Plugin catalog
├── .athena-workflow/
│ └── marketplace.json # Workflow catalog
├── workflows/
│ └── e2e-test-builder/
│ ├── workflow.json
│ └── system_prompt.md
└── plugins/
├── e2e-test-builder/
├── md-export/
└── site-knowledge/
Each plugin package now generates versioned runtime artifacts during npm pack / npm publish:
dist/<version>/release.jsondist/<version>/claude/plugin/dist/<version>/codex/plugin/dist/<version>/codex/marketplace.jsonThe generated runtime plugin directories are packaged artifacts, not source mirrors:
node_modules/ and lockfiles.Build them locally for one plugin with:
cd plugins/<plugin-name>
npm run build:artifacts
This repo uses a split skill metadata model so the same skills stay compatible with both Claude and Codex:
SKILL.mdagents/claude.yamlagents/openai.yamlFor the full conventions, see docs/skills-compatibility.md.
Official vendor-aligned configuration lives outside this packaging layer:
.claude/settings.json and .claude/settings.local.jsoncodex mcp add ... and ~/.codex/config.tomlThis repo also defines its own packaging overlays for distributing skills and plugins across runtimes:
.claude-plugin/plugin.json.codex-plugin/plugin.json.agents/plugins/marketplace.jsonagents/openai.yamlagents/claude.yamlThese repo overlay files are conventions used by this repository. They are not presented here as official vendor-standard file formats.
The repo includes a local Python 3.12 virtualenv at .venv for running the official Agent Skills validator.
Activate it with:
source .venv/bin/activate
Run the official portable validator across all plugin skills:
scripts/validate-skills-portable.sh
Run the repo-specific compatibility checks:
scripts/validate-skills-repo.sh
Run the lightweight local validator on a single skill:
scripts/quick-validate-skill.sh plugins/e2e-test-builder/skills/write-test-code
Create a new repo-compatible skill scaffold:
scripts/init-compatible-skill.py my-skill --path plugins/my-plugin/skills --interface display_name="My Skill" --interface short_description="Describe the skill in the UI" --interface default_prompt="Run my skill." --argument-hint "<arg>"
Generate or update Claude-only overlay metadata for an existing skill:
scripts/generate-claude-yaml.py plugins/my-plugin/skills/my-skill --frontmatter user-invocable=true --frontmatter argument-hint="<arg>"
Use these rules when editing or adding skills:
SKILL.mdargument-hint and user-invocable in agents/claude.yamldisplay_name, short_description, and default_prompt in agents/openai.yamlSKILL.md as the hand-authored source of truth for the skill itselfscripts/init-compatible-skill.py scaffolds SKILL.md, agents/openai.yaml, and agents/claude.yamlscripts/generate-claude-yaml.py regenerates agents/claude.yaml for an existing skillagents/claude.yaml should be treated as generated overlay metadata and may be overwritten when regeneratedagents/openai.yaml is created during scaffolding and may also be replaced if you rerun scaffold or metadata-generation flows for that skillnpx claudepluginhub lespaceman/athena-workflow-marketplace --plugin web-benchClaude Code plugin for browser automation
Export markdown files to clean, dark-themed PDFs
Site-specific automation patterns and knowledge for popular websites (Airbnb, Amazon, Apple Store)
Load testing and performance benchmarking with metrics analysis and bottleneck identification
API endpoint benchmarking and performance reporting
Run E2E browser tests using natural language test definitions powered by Claude Code SDK and agent-browser with video recording
Browser automation with Puppeteer CLI scripts. Use for screenshots, performance analysis, network monitoring, web scraping, form automation, or encountering JavaScript debugging, browser automation errors.
AI-powered browser automation -- lets Claude control real web browsers to navigate, click, type, extract content, and automate workflows
Browser automation with persistent page state. Use when users ask to navigate websites, fill forms, take screenshots, extract web data, test web apps, or automate browser workflows.