From scenario-testing
Framework for authoring, structuring, and managing scenarios — end-to-end user stories validated probabilistically by LLM-as-judge. Covers the holdout principle, scenario anatomy, versioning, composition, and anti-reward-hacking patterns.
How this skill is triggered — by the user, by Claude, or both
Slash command
/scenario-testing:scenario-methodologyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill provides the conceptual and practical framework for scenario-based validation. A scenario is not a test — it is a structured user story that describes what a user wants to accomplish and what outcomes would satisfy them.
This skill provides the conceptual and practical framework for scenario-based validation. A scenario is not a test — it is a structured user story that describes what a user wants to accomplish and what outcomes would satisfy them.
/scenario-testing:st:scenario — authoring new scenarios/scenario-testing:st:review — reviewing and refining scenarios/scenario-testing:st:catalog — managing the scenario catalog| Aspect | Traditional Test | Scenario |
|---|---|---|
| Stored | In the codebase | Outside the codebase (holdout) |
| Written in | Code (assertions) | YAML (natural language criteria) |
| Evaluator | Test runner (boolean) | LLM-as-judge (probabilistic) |
| Measures | Correctness | Satisfaction |
| Deterministic | Yes (same input → same result) | No (same scenario → distribution of trajectories) |
| Reward-hackable | Yes (code can be shaped to pass) | Resistant (holdout + LLM judgment) |
| Who understands it | Developers | Anyone (product, design, QA, developers) |
Scenarios are stored outside the codebase by default (.scenarios/ is gitignored). This mirrors the holdout set concept in machine learning:
The model (code) is developed against training data (tests) but validated against holdout data (scenarios). This prevents overfitting — the code can't be shaped to trivially pass scenarios it doesn't see.
Not all scenarios need to be holdout. Use in-repo storage when:
Configure via .scenarios/config.json:
{
"storage": "in-repo" // or "holdout" (default)
}
Every scenario has 7 required sections and 2 optional sections:
sso-login, export-to-sheets)auth, onboarding, integrations)Good criteria are:
Anti-patterns are the inverse of satisfaction criteria — they describe outcomes that are clearly wrong:
A trajectory that matches ANY anti-pattern is automatically judged "unsatisfactory", regardless of satisfaction criteria matches.
Draft → Review → Active → Versioned
↑ │
└──────────┘ (update + version bump)
/st:scenario, may be incomplete/st:review by the scenario-reviewer agentScenarios can be composed for complex workflows:
references/scenario-template.md — YAML template with field documentationreferences/criteria-patterns.md — Patterns for writing effective satisfaction criteria and anti-patternsnpx claudepluginhub dlabs/claude-marketplace --plugin scenario-testingGuides writing BDD scenarios in Gherkin with acceptance criteria, edge cases, tags, and organization. Use for defining behavior specs, test coverage, and requirements.
Creates detailed test scenarios from user stories with objectives, starting conditions, roles, step-by-step actions, and expected outcomes.
Generates comprehensive test scenarios from user stories, including test goals, initial conditions, user roles, step-by-step actions, and expected results. Use for QA test cases, test plans, acceptance tests, and functional verification.