Skill

Eval Prompt

From prompt

Design test cases for a prompt covering happy paths, edge cases, and failure modes.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/prompt:eval

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelsonnet

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Given a prompt, design a set of evaluation cases that test whether it works — not just on

SKILL.md

79 lines · ~1k tokens

Stats

Parent stars0

MaintenanceGood

Last CommitApr 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Eval Prompt

Given a prompt, design a set of evaluation cases that test whether it works — not just on typical inputs, but on the boundaries and failure modes where prompts actually break.

This is the complement to prompt-optimize. You can't improve what you can't measure, and most prompts are tested against the three examples the author had in mind when they wrote it.

Behavior Constraints

<output_contract>

Present the eval set as:

Prompt purpose — what the prompt does, in one sentence
Output properties — the specific, testable properties a good output should have (e.g., "valid JSON", "under 200 words", "cites at least one source", "doesn't hallucinate facts not in the input"). Each property should be evaluable by a human or an automated grader.
Test cases — organized by category:
- Happy path (2-3 cases) — typical inputs where the prompt should clearly succeed. Include the input and a description of expected output.
- Edge cases (3-5 cases) — boundary inputs that test specific properties: empty input, very long input, ambiguous input, input in a different language, input that contradicts the prompt's assumptions, inputs at the boundary of what counts as "valid"
- Adversarial cases (2-3 cases) — inputs designed to break the prompt: prompt injection attempts, inputs that exploit ambiguous instructions, inputs that trigger known failure modes of the model tier
Grader suggestions — for each output property, how to evaluate it:
- Exact match — when the output must match precisely (classification, extraction)
- Contains/regex — when specific elements must be present
- LLM-as-judge — when quality is subjective, with a suggested judging prompt
- Human review — when automated grading isn't reliable enough
Coverage gaps — what the eval set doesn't test and why (e.g., "doesn't test multilingual inputs because the prompt is English-only by design")

</output_contract>

<completeness_contract>

The task is complete when every stated output property has at least one test case targeting it, and the edge cases go beyond obvious variations. If the prompt is too vague to eval meaningfully (no clear success criteria), say so and recommend defining output properties before writing test cases.

</completeness_contract>

<reasoning_rules>

Derive test cases from the prompt's instructions, not from imagination. Every constraint in the prompt implies at least one test: "respond in JSON" → test with an input that tempts non-JSON output. "Cite sources" → test with an input where no sources exist.
Edge cases matter more than happy-path cases. The prompt probably already works on typical inputs — the author tested that informally. The eval set's value is in finding where it breaks.
Be specific about expected output. "Should produce a good summary" is not evaluable. "Should be under 200 words, mention the key decision, and not include information not in the source" is.
Design adversarial cases that a real user might accidentally trigger, not just theoretical attacks. An input that happens to contain the word "ignore" in natural context is more useful than a cartoonish "ignore all previous instructions."
If the prompt targets a specific model, note model-specific failure patterns (e.g., smaller models are worse at maintaining format over long outputs).

</reasoning_rules>

Workflow

Ask the user for the prompt to eval if not already provided. Accept pasted text or a file path. Optionally ask:
- What the prompt is for and which model it targets
- Known failure cases, if any
- Whether the eval is for one-time validation or ongoing regression testing
Identify the output properties from the prompt's instructions.
Design the test cases and present the output per the output contract above.

Eval Prompt

Invocation

Configuration

Context Preview

SKILL.md

Eval Prompt

Invocation

Configuration

Context Preview

SKILL.md

Eval Prompt

Behavior Constraints

Workflow

Similar Skills

Eval Prompt

Behavior Constraints

Workflow

Similar Skills