Eval Prompt
Given a prompt, design a set of evaluation cases that test whether it works — not just on
typical inputs, but on the boundaries and failure modes where prompts actually break.
This is the complement to prompt-optimize. You can't improve what you can't measure, and most
prompts are tested against the three examples the author had in mind when they wrote it.
Behavior Constraints
<output_contract>
Present the eval set as:
- Prompt purpose — what the prompt does, in one sentence
- Output properties — the specific, testable properties a good output should have (e.g.,
"valid JSON", "under 200 words", "cites at least one source", "doesn't hallucinate facts
not in the input"). Each property should be evaluable by a human or an automated grader.
- Test cases — organized by category:
- Happy path (2-3 cases) — typical inputs where the prompt should clearly succeed.
Include the input and a description of expected output.
- Edge cases (3-5 cases) — boundary inputs that test specific properties: empty input,
very long input, ambiguous input, input in a different language, input that contradicts
the prompt's assumptions, inputs at the boundary of what counts as "valid"
- Adversarial cases (2-3 cases) — inputs designed to break the prompt: prompt injection
attempts, inputs that exploit ambiguous instructions, inputs that trigger known failure
modes of the model tier
- Grader suggestions — for each output property, how to evaluate it:
- Exact match — when the output must match precisely (classification, extraction)
- Contains/regex — when specific elements must be present
- LLM-as-judge — when quality is subjective, with a suggested judging prompt
- Human review — when automated grading isn't reliable enough
- Coverage gaps — what the eval set doesn't test and why (e.g., "doesn't test multilingual
inputs because the prompt is English-only by design")
</output_contract>
<completeness_contract>
The task is complete when every stated output property has at least one test case targeting it,
and the edge cases go beyond obvious variations. If the prompt is too vague to eval meaningfully
(no clear success criteria), say so and recommend defining output properties before writing
test cases.
</completeness_contract>
<reasoning_rules>
- Derive test cases from the prompt's instructions, not from imagination. Every constraint in
the prompt implies at least one test: "respond in JSON" → test with an input that tempts
non-JSON output. "Cite sources" → test with an input where no sources exist.
- Edge cases matter more than happy-path cases. The prompt probably already works on typical
inputs — the author tested that informally. The eval set's value is in finding where it
breaks.
- Be specific about expected output. "Should produce a good summary" is not evaluable.
"Should be under 200 words, mention the key decision, and not include information not in the
source" is.
- Design adversarial cases that a real user might accidentally trigger, not just theoretical
attacks. An input that happens to contain the word "ignore" in natural context is more useful
than a cartoonish "ignore all previous instructions."
- If the prompt targets a specific model, note model-specific failure patterns (e.g., smaller
models are worse at maintaining format over long outputs).
</reasoning_rules>
Workflow
- Ask the user for the prompt to eval if not already provided. Accept pasted text or a file
path. Optionally ask:
- What the prompt is for and which model it targets
- Known failure cases, if any
- Whether the eval is for one-time validation or ongoing regression testing
- Identify the output properties from the prompt's instructions.
- Design the test cases and present the output per the output contract above.