Evaluation framework skills for designing scoring rubrics, running structured evaluations on LLM outputs, and comparing candidate outputs to recommend a winner.
Compare two LLM outputs on the same evaluation criteria and recommend a winner with justification. Use this skill when asked to "compare these outputs", "which response is better", "A/B eval", or "pick the best candidate".
Design evaluation criteria and a 1-5 scoring rubric for a task or LLM output. Use this skill when asked to "create an eval", "define evaluation criteria", "build a scoring rubric", or "design how to measure quality" for any output.
Execute a structured evaluation against a set of LLM outputs and produce a scored report. Use this skill when asked to "run the eval", "score these outputs", "evaluate this response", or "generate an evaluation report".
A plugin library and development framework for Claude Code. This repo serves two purposes:
/plugin install.# Add the marketplace
/plugin marketplace add ats-kinoshita-iso/agent-workshop
# Browse and install plugins
/plugin install planning@agent-workshop
Browse cookbook/ and copy what you need into your project:
cookbook/claude-md/ — CLAUDE.md templates by project typecookbook/hooks/ — Reusable hook recipes for .claude/settings.jsoncookbook/mcp/ — MCP server configurations for common integrations.claude-plugin/ # Marketplace definition (marketplace.json)
plugins/ # Stable, packaged Claude Code plugins
cookbook/ # Golden baseline configs (copy into your projects)
claude-md/ # CLAUDE.md templates
hooks/ # Hook recipes
mcp/ # MCP server configs
tools/ # Development & validation tooling
tests/ # Plugin validation gates
1. Develop in .claude/ → Iterate locally with Claude Code
2. Validate with test suite → uv run pytest
3. Package as plugin → Create plugin.json + SKILL.md in plugins/
4. Auto-register in marketplace → marketplace_gen.py updates marketplace.json
5. Users install from here → /plugin marketplace add ats-kinoshita-iso/agent-workshop
| Plugin | Description | Version |
|---|---|---|
| code-quality-gate | Unified quality orchestrator (lint, format, typecheck, test) | 1.0.0 |
| context-sync | Keeps CLAUDE.md files in sync with the codebase | 1.0.0 |
| plan-manager | Plan lifecycle management with gate tracking and archival | 1.0.0 |
| planning | Phased implementation plans with gates and tests | 2.0.0 |
| test-quality | Test generation, auditing, and knowledge extraction | 1.0.0 |
| workspace-clean | Workspace hygiene checks and cleanup | 1.0.0 |
| Plugin | Skills | License |
|---|---|---|
| anthropic-document-skills | docx, pdf, pptx, xlsx | Source-available |
| anthropic-creative-skills | algorithmic-art, brand-guidelines, canvas-design, frontend-design, slack-gif-creator, theme-factory | Apache 2.0 |
| anthropic-dev-skills | claude-api, mcp-builder, skill-creator, web-artifacts-builder, webapp-testing | Apache 2.0 |
| anthropic-enterprise-skills | doc-coauthoring, internal-comms | Apache 2.0 |
uvbunuv syncuv run pytestuv run ruff check .uv run ruff format .uv run mypy .bun installbun testbunx biome check --write ..claude/ (skills, hooks, agents, etc.)plugins/<your-plugin>/.claude-plugin/plugin.json with name, description, version, keywordsuv run pytest to validate structureuv run python tools/marketplace_gen.py to update the marketplace catalogOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin eval-frameworktrine-eval: Planner-Generator-Evaluator harness for eval-driven development across web/RAG/CLI/API projects, eval-harness methodology audits (meta layer), and harness-build agent-runtime conformance (runtime layer)
Analyzes and rewrites prompts for Claude Code, applying structured prompt engineering patterns to produce clearer, more effective instructions.
Anthropic's official development skills for Claude API integration, MCP server building, skill creation, web artifact building, and browser-based testing.
Observability skills for designing logging and tracing strategies, instrumenting existing code with structured log points, and analyzing trace logs to diagnose production issues.
Skills for designing and evaluating multi-agent systems: orchestrator/worker decomposition, output quality review, and self-improving evaluator/optimizer loops.
No description provided.
Skill evaluation and benchmarking - test skill effectiveness with behavioral eval cases, grade results, and track quality improvements
Use this agent when evaluating new development tools, frameworks, or services for the studio. This agent specializes in rapid tool assessment, comparative analysis, and making recommendations that align with the 6-day development cycle philosophy. Examples:\n\n<example>\nContext: Considering a new framework or library
LLM Judges plugin
Four-layer test framework for Claude Code plugin skills — structure validation, trigger accuracy, session testing, and skill value comparison
Create new skills, improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, or benchmark skill performance with variance analysis.