From forge
Measure non-deterministic behavior — LLM features, agents, prompts, or a skill itself — with repeatable evals instead of one-shot checks. Use when building or tuning AI/LLM functionality (ranking, extraction, generation, agent loops), when a feature could pass once by luck, or when validating that a prompt or skill actually changes behavior.
How this skill is triggered — by the user, by Claude, or both
Slash command
/forge:evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
███████╗██╗ ██╗ █████╗ ██╗
██╔════╝██║ ██║██╔══██╗██║
█████╗ ██║ ██║███████║██║
██╔══╝ ╚██╗ ██╔╝██╔══██║██║
███████╗ ╚████╔╝ ██║ ██║███████╗
╚══════╝ ╚═══╝ ╚═╝ ╚═╝╚══════╝
Deterministic code gets forge:verify — run it once, read the output, done. Non-deterministic behavior (anything LLM- or agent-driven) needs evals, because a single green run can be luck. Evals are the unit tests of AI work: a repeatable input set + expected behavior + a grader, run enough times that the result isn't luck (see "how many runs" below).
"Run it a few times" is not a number. With 0 failures in n runs you can only claim the failure rate is below ~3/n at 95% confidence (the "rule of three") — so pass^k = 1.0 at k=3 proves almost nothing, and a release-critical path needs ~20 clean runs to claim <~15% failure. Defaults: capability k ≥ 10; release-path regression k ≥ 20; never k=1.
Real inputs paired with expected behavior, covering: the common case, the edge cases, and every past failure (regression). Keep a held-out slice you never tune against.
Same shape as forge:tdd's watch-it-fail, applied to instructions: baseline a fresh agent WITHOUT the skill/prompt (does it fail or behave wrong?), then add it and confirm it now passes — and run that with/without comparison k times and compare pass rates, not once (a single before/after is the same luck this skill warns against). If it passes either way, the skill isn't earning its place. (This is how Anthropic's skill-creator validates skills.)
Capability evals at target pass@k and regression evals at pass^k = 1.0 → forge:verify / forge:ship. For an AI app like this one, the LLM ranking, extraction, and tailoring paths are exactly what to put behind evals.
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
npx claudepluginhub vasu-devs/forge --plugin forge