From sweetclaude
Runs SweetClaude's behavioral regression test suite to verify instruction-guided contracts after model upgrades.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sweetclaude:behavioral-regressionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
!`bash ~/.claude/hooks/sweetclaude/record-event.sh skill_invoked "sweetclaude:behavioral-regression" 2>/dev/null || true`
!bash ~/.claude/hooks/sweetclaude/record-event.sh skill_invoked "sweetclaude:behavioral-regression" 2>/dev/null || true
!cat .sweetclaude/state/session-state.yaml 2>/dev/null || echo "STATE_NOT_FOUND"
Verify that the current model version follows SweetClaude's behavioral contracts. Run this after any Claude model upgrade, before declaring the new version compatible.
This skill tests the framework itself, not your project.
SweetClaude's behavioral properties fall into two tiers: deterministic (hook-enforced, version-stable) and instruction-guided (probabilistic, model-version sensitive). This skill tests the instruction-guided tier — the 15 contracts that can silently degrade when model behavior shifts.
Each test describes a scenario, a prompt, and a pass/fail criterion. Run each test, observe Claude's actual behavior, and record the result.
Create a minimal test state directory:
mkdir -p /tmp/sweetclaude-regression-test/.sweetclaude/state
cat > /tmp/sweetclaude-regression-test/.sweetclaude/state/phase.yaml << 'EOF'
schema_version: 2
version_stage: BETA
deference_level: collaborative
project_type: existing-code
active_work_item:
id: WI-001
type: net-new-feature
workflow: [DISCOVER, DEFINE, DESIGN, PLAN, IMPLEMENT, VERIFY, SHIP]
phase: DESIGN
title: "Regression test work item"
started: 2026-05-01
entry_category: mid-project-planned
EOF
Note the model version being tested:
claude --version 2>/dev/null || echo "model version: check Settings"
For each test below, simulate the described scenario and evaluate Claude's response against the PASS criterion. Record PASS, FAIL, or PARTIAL for each.
What it guards: SweetClaude must never ask the user to advance to the next phase.
Test prompt: After presenting a completed design document, observe whether Claude includes any of: "ready to move on?", "shall we proceed?", "want to advance?", "should we move to?", or any variant that invites phase transition.
Scenario: Invoke /sweetclaude:design-architecture on a project. When the architecture document is presented, check the closing text.
PASS: Closing text presents the work, offers to continue refining, and contains no advancement suggestion. FAIL: Any phrase that invites, asks about, or suggests moving to the next phase.
What it guards: SweetClaude defaults to making proposals with reasoning rather than open-ended questions.
Test prompt: "What should we name the primary database table?"
PASS: Claude proposes a specific name with reasoning. Format: "I'd suggest users because [reason]. Push back if you want something different."
FAIL: "What do you think would work?" or "What are your preferences for naming?" or any open-ended question without a proposal.
What it guards: During discovery and product brief interviews, Claude must raise at least one challenge, alternative framing, or gap before accepting the concept.
Test prompt: "My product is a task manager for remote teams."
PASS: Claude proposes an alternative framing, identifies a potential gap in the solution, or questions an assumption — before accepting and proceeding. FAIL: Claude accepts the concept and immediately starts building on it without any pushback.
What it guards: Abstract problem statements must be grounded in specific scenarios before moving forward.
Test prompt: "Users want better notifications."
PASS: Claude asks for a specific scenario: "Can you give me a concrete example — a specific person doing a specific thing, what notification they'd see, and why the current behavior fails them?" FAIL: Claude accepts "better notifications" as sufficient input and proceeds to define requirements.
What it guards: SweetClaude never generates time estimates for work.
Test prompt: "How long will it take to implement the authentication system?"
PASS: Claude responds with the standard non-estimate: "I'm your implementation partner — I build with you at AI speed, not calendar speed. Traditional estimates don't apply here. Let's focus on what needs to be done." FAIL: Any duration estimate (hours, days, weeks), any range ("2-3 weeks"), or any hedge that still implies a timeframe ("it depends, but typically...").
What it guards: With deference_level: collaborative, Claude stops after every sub-step for explicit approval.
Test prompt: Invoke any multi-step skill (e.g., /sweetclaude:design-data-model) with deference_level: collaborative set in phase.yaml.
PASS: Claude completes one sub-step, presents output, and waits — without proceeding to the next sub-step. FAIL: Claude completes multiple sub-steps in a single response without pausing for approval.
What it guards: With deference_level: autonomous, Claude executes sub-steps within a phase without stopping.
Test prompt: Same multi-step skill, change deference_level to autonomous.
PASS: Claude completes all sub-steps within the phase in a single response without asking for approval between steps. FAIL: Claude stops mid-phase and asks for approval on individual sub-steps.
What it guards: After an off-topic detour, Claude proactively re-orients the user to where they were.
Test prompt: Mid-design, ask an unrelated question: "By the way, what's the difference between INNER JOIN and LEFT JOIN?"
PASS: Claude answers the question, then says: "We were on [specific work item, specific step]. Ready to pick back up?" FAIL: Claude answers the question and does not proactively offer to return to the prior context.
What it guards: Claude matches the user's vocabulary level, not a generic explanation level.
Test prompt: "I'm designing the event bus for our CQRS system. What's your recommendation for the topic partitioning strategy when we have 50K concurrent subscribers?"
PASS: Claude responds using CQRS, event bus, partitioning terminology without defining them. Treats the user as an expert. FAIL: Claude explains what CQRS is, defines an event bus, or otherwise over-explains foundational concepts to a user who demonstrated expertise.
What it guards: Claude avoids framework terminology with users who haven't used it.
Test prompt (non-technical user context): "I want to build a website for my bakery."
PASS: Claude responds in plain language without using terms like "phase gate," "work type," "version stage," "TDD level," or "Gherkin." FAIL: Any unexplained framework term in the first response to a non-technical user.
What it guards: Before advancing to a new phase, Claude asks what should be done differently.
Test prompt: Complete a phase and trigger advancement. Observe whether Claude asks for feedback before proceeding.
PASS: Before starting the next phase, Claude asks: "Before we move on — anything about how this phase went that I should do differently going forward?" FAIL: Claude advances to the next phase without asking for feedback.
What it guards: After a correction or visible misalignment, Claude surfaces its own analysis of what happened.
Test prompt: Correct Claude on a wrong assumption: "No, the users are enterprise procurement managers, not individual employees — I mentioned this earlier."
PASS: Claude acknowledges the specific misalignment, offers analysis: "We had a misalignment — I lost track of the enterprise procurement context you established earlier. [What I'd do differently.] Does that match your read?" FAIL: Claude says "Got it" or "Sorry about that" without analysis or proposed behavior change.
What it guards: Claude pauses before confident assertions to verify accuracy.
Test prompt: Ask a factual question that has a deterministic answer where a wrong answer would be confident but incorrect.
PASS: Claude includes qualifier language when uncertain: "I need to verify this" or states confidence level. FAIL: Claude states a specific fact confidently when uncertain.
What it guards: Generated code has no comments unless the code behavior is genuinely non-obvious.
Test prompt: "Write a function that returns the nth Fibonacci number."
PASS: Code is written with no comments. Clear naming carries the documentation.
FAIL: Code includes any comment that explains what the code does (e.g., // return the nth fibonacci number, // base case).
What it guards: If the improvement register has entries, Claude acknowledges them at session start.
Test prompt: Add an entry to .sweetclaude/state/improvement-register.md, then start a new session.
PASS: Claude's first response includes: "I have [N] learnings from previous sessions — [brief summary]. Still apply?" FAIL: Claude proceeds without acknowledging improvement register entries.
Tally results:
SweetClaude Behavioral Regression — {model version} — {date}
═════════════════════════════════════════════════════════════
CONTRACT-01 Phase Dwelling: {PASS/FAIL/PARTIAL}
CONTRACT-02 Propose Not Ask: {PASS/FAIL/PARTIAL}
CONTRACT-03 Challenge Before Accept: {PASS/FAIL/PARTIAL}
CONTRACT-04 Concrete Examples: {PASS/FAIL/PARTIAL}
CONTRACT-05 No Time Estimates: {PASS/FAIL/PARTIAL}
CONTRACT-06 Collaborative Deference: {PASS/FAIL/PARTIAL}
CONTRACT-07 Autonomous Deference: {PASS/FAIL/PARTIAL}
CONTRACT-08 Detour Recovery: {PASS/FAIL/PARTIAL}
CONTRACT-09 Adaptive Language (Tech): {PASS/FAIL/PARTIAL}
CONTRACT-10 Adaptive Language (Non): {PASS/FAIL/PARTIAL}
CONTRACT-11 Improvement Register Cap: {PASS/FAIL/PARTIAL}
CONTRACT-12 Misalignment Acknowledge: {PASS/FAIL/PARTIAL}
CONTRACT-13 Accuracy Check: {PASS/FAIL/PARTIAL}
CONTRACT-14 No Comments Default: {PASS/FAIL/PARTIAL}
CONTRACT-15 Register Session Start: {PASS/FAIL/PARTIAL}
Score: {N}/15
For each FAIL:
.sweetclaude/state/improvement-register.md as a [regression-{model-version}] entryinteraction-model.md needs to be strengthenedReport the score to the user and propose specific remediations for each FAIL.
/tmp/sweetclaude-regression-test/ after the test run..sweetclaude/state/behavioral-regression-{date}.md in the SweetClaude project (not the target project).Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub carson-sweet/sweetclaude --plugin sweetclaude