Skill

dev-prompt-evaluation

Designs, tests, compares, versions, and validates prompts or LLM behavior using measurable criteria and datasets. Useful when evaluating prompt quality, edge cases, and deployment readiness.

ai-ml

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/codex-next:dev-prompt-evaluation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this workflow when prompts, agent instructions, LLM outputs, or AI app behavior need systematic evaluation.

SKILL.md

60 lines · ~541 tokens

Stats

LanguagePython

Parent stars20

Parent forks1

MaintenanceExcellent

Last CommitJun 15, 2026

Actions

View Source View Plugin View on GitHub View README

Prompt evaluation workflow

Use this workflow when prompts, agent instructions, LLM outputs, or AI app behavior need systematic evaluation.

Steps

Define the prompt surface.
- System, developer, user, tool, retrieval, memory, and output-format instructions.
- Runtime model, temperature or reasoning settings, tool access, and context limits when known.
- Current failure modes or quality goals.
Define success criteria.
- Correctness, completeness, citation quality, format compliance, tone, latency, cost, refusal behavior, or safety.
- Pass/fail checks where possible.
- Human review rubric when automated scoring is not enough.
Build an evaluation set.
- Representative normal cases.
- Edge cases and adversarial cases.
- Regression cases from previous failures.
- Expected outputs, constraints, or scoring notes.
Compare prompt variants.
- Keep one controlled change per variant when feasible.
- Track version, rationale, expected improvement, and observed outcome.
- Measure token usage and latency if available.
Validate tool and retrieval behavior.
- Check whether the prompt asks for unavailable tools or hidden context.
- Verify citations, file references, and retrieved evidence.
- Confirm that fallback behavior is explicit.
Decide deployment readiness.
- Keep the simplest prompt that meets quality goals.
- Document known weaknesses and monitoring signals.
- Add regression examples for failures that should not return.

Output

Return:

Prompt surface reviewed
Evaluation criteria
Test set or representative cases
Variant comparison
Recommended prompt version
Cost, latency, safety, and monitoring notes

Do not

Do not optimize by vibes only.
Do not compare multiple changes without recording what changed.
Do not claim robustness from a tiny or unrepresentative test set.
Do not assume a model, tool, or retrieval source exists without checking runtime evidence.

dev-prompt-evaluation

Popularity

Invocation

Context Preview

SKILL.md

dev-prompt-evaluation

Popularity

Invocation

Context Preview

SKILL.md

Prompt evaluation workflow

Steps

Output

Do not

Similar Skills

Prompt evaluation workflow

Steps

Output

Do not

Similar Skills