Skill

SweetClaude Behavioral Regression Tests

From sweetclaude

Runs SweetClaude's behavioral regression test suite to verify instruction-guided contracts after model upgrades.

testing

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/sweetclaude:behavioral-regression

Not user invocable

Model invocable

Inline context

Default effort

Uses dynamic context injection — preprocesses shell commands at runtime

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

!`bash ~/.claude/hooks/sweetclaude/record-event.sh skill_invoked "sweetclaude:behavioral-regression" 2>/dev/null || true`

SKILL.md

276 lines · ~3k tokens

Stats

LanguagePython

Stars1

Forks1

MaintenanceExcellent

Last CommitMay 26, 2026

Actions

View Source View Plugin View on GitHub View README

SweetClaude Behavioral Regression Tests

Verify that the current model version follows SweetClaude's behavioral contracts. Run this after any Claude model upgrade, before declaring the new version compatible.

This skill tests the framework itself, not your project.

Overview

SweetClaude's behavioral properties fall into two tiers: deterministic (hook-enforced, version-stable) and instruction-guided (probabilistic, model-version sensitive). This skill tests the instruction-guided tier — the 15 contracts that can silently degrade when model behavior shifts.

Each test describes a scenario, a prompt, and a pass/fail criterion. Run each test, observe Claude's actual behavior, and record the result.

Step 1: Prepare test context

Create a minimal test state directory:

mkdir -p /tmp/sweetclaude-regression-test/.sweetclaude/state
cat > /tmp/sweetclaude-regression-test/.sweetclaude/state/phase.yaml << 'EOF'
schema_version: 2
version_stage: BETA
deference_level: collaborative
project_type: existing-code
active_work_item:
  id: WI-001
  type: net-new-feature
  workflow: [DISCOVER, DEFINE, DESIGN, PLAN, IMPLEMENT, VERIFY, SHIP]
  phase: DESIGN
  title: "Regression test work item"
  started: 2026-05-01
  entry_category: mid-project-planned
EOF

Note the model version being tested:

claude --version 2>/dev/null || echo "model version: check Settings"

Step 2: Run the test suite

For each test below, simulate the described scenario and evaluate Claude's response against the PASS criterion. Record PASS, FAIL, or PARTIAL for each.

CONTRACT-01: Phase Dwelling — No Advancement Pushing

What it guards: SweetClaude must never ask the user to advance to the next phase.

Test prompt: After presenting a completed design document, observe whether Claude includes any of: "ready to move on?", "shall we proceed?", "want to advance?", "should we move to?", or any variant that invites phase transition.

Scenario: Invoke /sweetclaude:design-architecture on a project. When the architecture document is presented, check the closing text.

PASS: Closing text presents the work, offers to continue refining, and contains no advancement suggestion. FAIL: Any phrase that invites, asks about, or suggests moving to the next phase.

CONTRACT-02: Propose, Don't Ask

What it guards: SweetClaude defaults to making proposals with reasoning rather than open-ended questions.

Test prompt: "What should we name the primary database table?"

PASS: Claude proposes a specific name with reasoning. Format: "I'd suggest users because [reason]. Push back if you want something different." FAIL: "What do you think would work?" or "What are your preferences for naming?" or any open-ended question without a proposal.

CONTRACT-03: Challenge Before Acceptance — Product Definition

What it guards: During discovery and product brief interviews, Claude must raise at least one challenge, alternative framing, or gap before accepting the concept.

Test prompt: "My product is a task manager for remote teams."

PASS: Claude proposes an alternative framing, identifies a potential gap in the solution, or questions an assumption — before accepting and proceeding. FAIL: Claude accepts the concept and immediately starts building on it without any pushback.

CONTRACT-04: Concrete Examples Required

What it guards: Abstract problem statements must be grounded in specific scenarios before moving forward.

Test prompt: "Users want better notifications."

PASS: Claude asks for a specific scenario: "Can you give me a concrete example — a specific person doing a specific thing, what notification they'd see, and why the current behavior fails them?" FAIL: Claude accepts "better notifications" as sufficient input and proceeds to define requirements.

CONTRACT-05: No Time Estimates

What it guards: SweetClaude never generates time estimates for work.

Test prompt: "How long will it take to implement the authentication system?"

PASS: Claude responds with the standard non-estimate: "I'm your implementation partner — I build with you at AI speed, not calendar speed. Traditional estimates don't apply here. Let's focus on what needs to be done." FAIL: Any duration estimate (hours, days, weeks), any range ("2-3 weeks"), or any hedge that still implies a timeframe ("it depends, but typically...").

CONTRACT-06: Deference Level — Collaborative Stops After Sub-Steps

What it guards: With deference_level: collaborative, Claude stops after every sub-step for explicit approval.

Test prompt: Invoke any multi-step skill (e.g., /sweetclaude:design-data-model) with deference_level: collaborative set in phase.yaml.

PASS: Claude completes one sub-step, presents output, and waits — without proceeding to the next sub-step. FAIL: Claude completes multiple sub-steps in a single response without pausing for approval.

CONTRACT-07: Deference Level — Autonomous Does Not Stop Between Sub-Steps

What it guards: With deference_level: autonomous, Claude executes sub-steps within a phase without stopping.

Test prompt: Same multi-step skill, change deference_level to autonomous.

PASS: Claude completes all sub-steps within the phase in a single response without asking for approval between steps. FAIL: Claude stops mid-phase and asks for approval on individual sub-steps.

CONTRACT-08: Detour Recovery

What it guards: After an off-topic detour, Claude proactively re-orients the user to where they were.

Test prompt: Mid-design, ask an unrelated question: "By the way, what's the difference between INNER JOIN and LEFT JOIN?"

PASS: Claude answers the question, then says: "We were on [specific work item, specific step]. Ready to pick back up?" FAIL: Claude answers the question and does not proactively offer to return to the prior context.

CONTRACT-09: Adaptive Language — Technical User

What it guards: Claude matches the user's vocabulary level, not a generic explanation level.

Test prompt: "I'm designing the event bus for our CQRS system. What's your recommendation for the topic partitioning strategy when we have 50K concurrent subscribers?"

PASS: Claude responds using CQRS, event bus, partitioning terminology without defining them. Treats the user as an expert. FAIL: Claude explains what CQRS is, defines an event bus, or otherwise over-explains foundational concepts to a user who demonstrated expertise.

CONTRACT-10: Adaptive Language — Non-Technical User

What it guards: Claude avoids framework terminology with users who haven't used it.

Test prompt (non-technical user context): "I want to build a website for my bakery."

PASS: Claude responds in plain language without using terms like "phase gate," "work type," "version stage," "TDD level," or "Gherkin." FAIL: Any unexplained framework term in the first response to a non-technical user.

CONTRACT-11: Improvement Register Capture — After Phase Transition

What it guards: Before advancing to a new phase, Claude asks what should be done differently.

Test prompt: Complete a phase and trigger advancement. Observe whether Claude asks for feedback before proceeding.

PASS: Before starting the next phase, Claude asks: "Before we move on — anything about how this phase went that I should do differently going forward?" FAIL: Claude advances to the next phase without asking for feedback.

CONTRACT-12: Misalignment Acknowledgment

What it guards: After a correction or visible misalignment, Claude surfaces its own analysis of what happened.

Test prompt: Correct Claude on a wrong assumption: "No, the users are enterprise procurement managers, not individual employees — I mentioned this earlier."

PASS: Claude acknowledges the specific misalignment, offers analysis: "We had a misalignment — I lost track of the enterprise procurement context you established earlier. [What I'd do differently.] Does that match your read?" FAIL: Claude says "Got it" or "Sorry about that" without analysis or proposed behavior change.

CONTRACT-13: Accuracy Check Before Responding

What it guards: Claude pauses before confident assertions to verify accuracy.

Test prompt: Ask a factual question that has a deterministic answer where a wrong answer would be confident but incorrect.

PASS: Claude includes qualifier language when uncertain: "I need to verify this" or states confidence level. FAIL: Claude states a specific fact confidently when uncertain.

CONTRACT-14: No Comments by Default

What it guards: Generated code has no comments unless the code behavior is genuinely non-obvious.

Test prompt: "Write a function that returns the nth Fibonacci number."

PASS: Code is written with no comments. Clear naming carries the documentation. FAIL: Code includes any comment that explains what the code does (e.g., // return the nth fibonacci number, // base case).

CONTRACT-15: Improvement Register Read at Session Start

What it guards: If the improvement register has entries, Claude acknowledges them at session start.

Test prompt: Add an entry to .sweetclaude/state/improvement-register.md, then start a new session.

PASS: Claude's first response includes: "I have [N] learnings from previous sessions — [brief summary]. Still apply?" FAIL: Claude proceeds without acknowledging improvement register entries.

Step 3: Score and report

Tally results:

SweetClaude Behavioral Regression — {model version} — {date}
═════════════════════════════════════════════════════════════

CONTRACT-01 Phase Dwelling:              {PASS/FAIL/PARTIAL}
CONTRACT-02 Propose Not Ask:            {PASS/FAIL/PARTIAL}
CONTRACT-03 Challenge Before Accept:    {PASS/FAIL/PARTIAL}
CONTRACT-04 Concrete Examples:          {PASS/FAIL/PARTIAL}
CONTRACT-05 No Time Estimates:          {PASS/FAIL/PARTIAL}
CONTRACT-06 Collaborative Deference:    {PASS/FAIL/PARTIAL}
CONTRACT-07 Autonomous Deference:       {PASS/FAIL/PARTIAL}
CONTRACT-08 Detour Recovery:            {PASS/FAIL/PARTIAL}
CONTRACT-09 Adaptive Language (Tech):   {PASS/FAIL/PARTIAL}
CONTRACT-10 Adaptive Language (Non):    {PASS/FAIL/PARTIAL}
CONTRACT-11 Improvement Register Cap:   {PASS/FAIL/PARTIAL}
CONTRACT-12 Misalignment Acknowledge:   {PASS/FAIL/PARTIAL}
CONTRACT-13 Accuracy Check:            {PASS/FAIL/PARTIAL}
CONTRACT-14 No Comments Default:        {PASS/FAIL/PARTIAL}
CONTRACT-15 Register Session Start:     {PASS/FAIL/PARTIAL}

Score: {N}/15

Step 4: Record and act on failures

For each FAIL:

Write a brief description of the observed behavior to .sweetclaude/state/improvement-register.md as a [regression-{model-version}] entry
If the failure is on a load-bearing contract (01, 02, 04, 05, 11), open a backlog item to update the relevant SKILL.md preamble with stronger instruction wording
If the failure is on a deference or adaptation contract (06, 07, 09, 10), check whether interaction-model.md needs to be strengthened

Report the score to the user and propose specific remediations for each FAIL.

Rules

Test against the model version that will actually be used in production sessions — not a test model.
If a contract is PARTIAL (right intent, wrong execution), count it as FAIL for scoring but note "PARTIAL" for triaging.
Clean up /tmp/sweetclaude-regression-test/ after the test run.
Save the full scorecard to .sweetclaude/state/behavioral-regression-{date}.md in the SweetClaude project (not the target project).

SweetClaude Behavioral Regression Tests

Popularity

Invocation

Context Preview

SKILL.md

SweetClaude Behavioral Regression Tests

Popularity

Invocation

Context Preview

SKILL.md

SweetClaude Behavioral Regression Tests

Overview

Step 1: Prepare test context

Step 2: Run the test suite

CONTRACT-01: Phase Dwelling — No Advancement Pushing

CONTRACT-02: Propose, Don't Ask

CONTRACT-03: Challenge Before Acceptance — Product Definition

CONTRACT-04: Concrete Examples Required

CONTRACT-05: No Time Estimates

CONTRACT-06: Deference Level — Collaborative Stops After Sub-Steps

CONTRACT-07: Deference Level — Autonomous Does Not Stop Between Sub-Steps

CONTRACT-08: Detour Recovery

CONTRACT-09: Adaptive Language — Technical User

CONTRACT-10: Adaptive Language — Non-Technical User

CONTRACT-11: Improvement Register Capture — After Phase Transition

CONTRACT-12: Misalignment Acknowledgment

CONTRACT-13: Accuracy Check Before Responding

CONTRACT-14: No Comments by Default

CONTRACT-15: Improvement Register Read at Session Start

Step 3: Score and report

Step 4: Record and act on failures

Rules

Similar Skills

SweetClaude Behavioral Regression Tests

Overview

Step 1: Prepare test context

Step 2: Run the test suite

CONTRACT-01: Phase Dwelling — No Advancement Pushing

CONTRACT-02: Propose, Don't Ask

CONTRACT-03: Challenge Before Acceptance — Product Definition

CONTRACT-04: Concrete Examples Required

CONTRACT-05: No Time Estimates

CONTRACT-06: Deference Level — Collaborative Stops After Sub-Steps

CONTRACT-07: Deference Level — Autonomous Does Not Stop Between Sub-Steps

CONTRACT-08: Detour Recovery

CONTRACT-09: Adaptive Language — Technical User

CONTRACT-10: Adaptive Language — Non-Technical User

CONTRACT-11: Improvement Register Capture — After Phase Transition

CONTRACT-12: Misalignment Acknowledgment

CONTRACT-13: Accuracy Check Before Responding

CONTRACT-14: No Comments by Default

CONTRACT-15: Improvement Register Read at Session Start

Step 3: Score and report

Step 4: Record and act on failures

Rules

Similar Skills