From codex-next
Designs, tests, compares, versions, and validates prompts or LLM behavior using measurable criteria and datasets. Useful when evaluating prompt quality, edge cases, and deployment readiness.
How this skill is triggered — by the user, by Claude, or both
Slash command
/codex-next:dev-prompt-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this workflow when prompts, agent instructions, LLM outputs, or AI app behavior need systematic evaluation.
Use this workflow when prompts, agent instructions, LLM outputs, or AI app behavior need systematic evaluation.
Define the prompt surface.
Define success criteria.
Build an evaluation set.
Compare prompt variants.
Validate tool and retrieval behavior.
Decide deployment readiness.
Return:
npx claudepluginhub blueskyxn/codex-is-all-you-need --plugin codex-nextDesigns test cases, adversarial inputs, and iterates on prompts based on eval results. Useful for prompt-engineering tasks like drafting, testing, and refining prompts and skills.
Guides building evals before prompts for LLM features, agents, or prompts. Helps measure improvement objectively and avoid speculative iteration.