Skill

scenario-methodology

Framework for authoring, structuring, and managing scenarios — end-to-end user stories validated probabilistically by LLM-as-judge. Covers the holdout principle, scenario anatomy, versioning, composition, and anti-reward-hacking patterns.

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/scenario-testing:scenario-methodology

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill provides the conceptual and practical framework for scenario-based validation. A scenario is not a test — it is a structured user story that describes what a user wants to accomplish and what outcomes would satisfy them.

Supporting Files

references/criteria-patterns.mdreferences/scenario-template.md

SKILL.md

115 lines · ~1.3k tokens

Stats

LanguageShell

Parent stars1

Parent forks1

MaintenanceGood

Last CommitFeb 21, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Scenario Methodology

When to Use

/scenario-testing:st:scenario — authoring new scenarios
/scenario-testing:st:review — reviewing and refining scenarios
/scenario-testing:st:catalog — managing the scenario catalog
When any agent needs to understand what a scenario is and how to write one

Key Distinction: Scenario vs. Test

Aspect	Traditional Test	Scenario
Stored	In the codebase	Outside the codebase (holdout)
Written in	Code (assertions)	YAML (natural language criteria)
Evaluator	Test runner (boolean)	LLM-as-judge (probabilistic)
Measures	Correctness	Satisfaction
Deterministic	Yes (same input → same result)	No (same scenario → distribution of trajectories)
Reward-hackable	Yes (code can be shaped to pass)	Resistant (holdout + LLM judgment)
Who understands it	Developers	Anyone (product, design, QA, developers)

The Holdout Principle

Scenarios are stored outside the codebase by default (.scenarios/ is gitignored). This mirrors the holdout set concept in machine learning:

Training data = your codebase, including unit and integration tests
Holdout data = your scenarios, stored separately
Evaluation = running scenarios against the code and measuring satisfaction

The model (code) is developed against training data (tests) but validated against holdout data (scenarios). This prevents overfitting — the code can't be shaped to trivially pass scenarios it doesn't see.

When to Use In-Repo Scenarios

Not all scenarios need to be holdout. Use in-repo storage when:

The team needs to collaboratively edit scenarios
Scenarios are tied to specific features and should version with the code
You trust the development process not to game the scenarios

Configure via .scenarios/config.json:

{
  "storage": "in-repo"  // or "holdout" (default)
}

Scenario Anatomy

Every scenario has 7 required sections and 2 optional sections:

Required

id — unique identifier, kebab-case (e.g., sso-login, export-to-sheets)
domain — category grouping (e.g., auth, onboarding, integrations)
version — integer version number, bumped on changes
persona — who is the user (role, expertise, goals)
context — starting state (data, permissions, environment, services)
intent — what the user wants to accomplish (1-2 sentences)
satisfaction_criteria — list of outcomes that would satisfy the user

Optional

anti_patterns — outcomes that would definitely NOT satisfy the user
chaos — failure conditions to inject during execution

Writing Satisfaction Criteria

Good criteria are:

Specific enough to judge — "Ticket has a descriptive title" is judgeable; "Ticket is good" is not
Flexible enough to allow valid variation — "Priority is set appropriately (High or Medium for regression bugs)" allows judgment; "Priority is exactly High" is too rigid
User-centered — describe what the user would notice, not what the code does internally
Independent — each criterion can be evaluated separately

Writing Anti-Patterns

Anti-patterns are the inverse of satisfaction criteria — they describe outcomes that are clearly wrong:

"Raw error message shown to user"
"Agent enters infinite retry loop"
"Data is written to wrong account"
"User is asked more than 3 clarifying questions before any action"

A trajectory that matches ANY anti-pattern is automatically judged "unsatisfactory", regardless of satisfaction criteria matches.

Scenario Lifecycle

Draft → Review → Active → Versioned
                   ↑          │
                   └──────────┘ (update + version bump)

Draft — authored via /st:scenario, may be incomplete
Review — reviewed via /st:review by the scenario-reviewer agent
Active — in the catalog, used for validation runs
Versioned — updated with changelog, satisfaction history preserved per version

Composition

Scenarios can be composed for complex workflows:

Sequential — scenario A's end state is scenario B's start state
Parallel — scenarios A and B run independently, overall satisfaction is the aggregate
Conditional — scenario B only runs if scenario A's satisfaction meets a threshold

References

references/scenario-template.md — YAML template with field documentation
references/criteria-patterns.md — Patterns for writing effective satisfaction criteria and anti-patterns

scenario-methodology

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

scenario-methodology

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Scenario Methodology

When to Use

Key Distinction: Scenario vs. Test

The Holdout Principle

When to Use In-Repo Scenarios

Scenario Anatomy

Required

Optional

Writing Satisfaction Criteria

Writing Anti-Patterns

Scenario Lifecycle

Composition

References

Similar Skills

Scenario Methodology

When to Use

Key Distinction: Scenario vs. Test

The Holdout Principle

When to Use In-Repo Scenarios

Scenario Anatomy

Required

Optional

Writing Satisfaction Criteria

Writing Anti-Patterns

Scenario Lifecycle

Composition

References

Similar Skills