Skill

satisfaction-metrics

Framework for measuring, aggregating, and trending satisfaction scores across scenarios. Covers LLM-as-judge methodology, trajectory evaluation, threshold configuration, comparison analysis, and reporting.

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/scenario-testing:satisfaction-metrics

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill provides the measurement framework for scenario validation. Satisfaction is a probabilistic metric — the fraction of observed trajectories through scenarios that an LLM judge deems "satisfactory" for the described user.

Supporting Files

references/judge-prompt-template.mdreferences/report-schemas.md

SKILL.md

270 lines · ~2k tokens

Stats

LanguageShell

Parent stars1

Parent forks1

MaintenanceGood

Last CommitFeb 21, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Satisfaction Metrics

When to Use

/scenario-testing:st:satisfy — judging trajectories
/scenario-testing:st:report — generating satisfaction reports
/scenario-testing:st:validate — full pipeline validation
When any agent needs to understand how satisfaction is computed

Why Satisfaction, Not Correctness

Traditional tests measure correctness: does the output match the expected output?

Satisfaction measures something different: would the user described in the scenario consider this outcome acceptable?

This distinction matters because:

Agentic software has many valid execution paths — there is no single "correct" output
LLM-powered features produce non-deterministic outputs — running the same scenario twice yields different (but potentially equally good) results
User satisfaction is context-dependent — the persona's expertise, goals, and expectations shape what counts as "good enough"

The Satisfaction Computation

Per-Trajectory

Each trajectory is judged independently:

judge(trajectory, scenario) → {
  verdict: "satisfactory" | "unsatisfactory",
  reasoning: string,
  criteria_matched: string[],
  anti_patterns_matched: string[],
  confidence: float  // 0.0 to 1.0
}

A trajectory is "satisfactory" when ALL of:

At least ONE satisfaction criterion is matched
ZERO anti-patterns are matched
The judge believes the described persona would accept this outcome

Per-Scenario

satisfaction(scenario) = count(satisfactory) / count(total_trajectories)

The number of trajectories per scenario is configurable (default: 50). More trajectories give a tighter confidence interval on the satisfaction estimate.

Per-Domain

satisfaction(domain) = mean(satisfaction(scenario) for scenario in domain)

Equal weighting across scenarios in the domain. Override with explicit weights in config if some scenarios are more important.

Overall

satisfaction(overall) = weighted_mean(satisfaction(domain) for domain in all_domains)
weight(domain) = total_trajectory_count(domain)

Weighted by trajectory count to avoid small domains disproportionately affecting the overall score.

LLM-as-Judge Methodology

Default Judge Configuration

model: opus
temperature: 0.0  # deterministic judgment
max_tokens: 2000

system_prompt: |
  You are a satisfaction judge evaluating whether a software interaction trajectory
  would satisfy the user described in the scenario.

  You will receive:
  1. The scenario (persona, intent, satisfaction criteria, anti-patterns)
  2. The trajectory (actions taken, state transitions, final outcome)

  Evaluate:
  - Does the outcome match ANY of the satisfaction criteria?
  - Does the trajectory match ANY anti-patterns?
  - Would the described persona consider this outcome acceptable?

  Respond with a JSON object:
  {
    "verdict": "satisfactory" or "unsatisfactory",
    "reasoning": "2-3 sentence explanation",
    "criteria_matched": ["list of matched criteria"],
    "anti_patterns_matched": ["list of matched anti-patterns"],
    "confidence": 0.0 to 1.0
  }

Why Temperature 0?

The judge should be deterministic — the same trajectory judged twice should get the same verdict. Temperature 0 ensures consistent evaluation. If you want to measure judge variance, run multiple judgment passes and compare.

Custom Judges

Override the default judge per-scenario or per-domain:

# Per-scenario (in the scenario YAML)
judge_config:
  system_prompt: |
    You are a security-focused judge. Weight data protection
    and access control heavily in your evaluation.
  strict_mode: true  # any anti-pattern = unsatisfactory regardless

// Per-domain (in config.json)
{
  "judge_overrides": {
    "auth": {
      "system_prompt": "...",
      "strict_mode": true
    }
  }
}

Thresholds

Configuration

{
  "thresholds": {
    "global_minimum": 0.80,
    "domains": {
      "auth": 0.95,
      "payments": 0.98,
      "onboarding": 0.75,
      "integrations": 0.85
    },
    "scenarios": {
      "sso-login": 0.98,
      "password-reset": 0.90
    }
  }
}

Threshold Hierarchy

Scenario-specific > domain-specific > global minimum. If a scenario has its own threshold, that takes precedence.

Threshold Semantics

Above threshold — the feature is working well enough for users
Below threshold — investigation needed, satisfaction has degraded
Below global minimum — critical alert, feature may be broken

Comparison Analysis

Run-over-Run Comparison

/scenario-testing:st:report --compare-to 2026-02-15

Computes delta per-scenario and highlights:

Improvements (positive delta > 0.02)
Regressions (negative delta > 0.02)
Stable (delta within ±0.02)

Trend Analysis

/scenario-testing:st:report --trend --days 30

Shows satisfaction over time, helping identify:

Gradual degradation (satisfaction slowly declining)
Step-function changes (satisfaction dropped after a specific date/deploy)
Recovery patterns (satisfaction dropped and came back)

Statistical Considerations

Sample Size

The number of trajectories affects confidence in the satisfaction estimate:

Trajectories	95% Confidence Interval Width
10	±0.31
25	±0.20
50	±0.14
100	±0.10
200	±0.07
500	±0.04

For critical scenarios (auth, payments), use 100+ trajectories. For exploratory features, 25-50 is sufficient.

Judge Reliability

The LLM judge is itself non-deterministic (even at temperature 0, model updates can change behavior). Periodically:

Re-judge a sample of past trajectories and compare verdicts
Compute inter-judge agreement if using multiple judge configurations
Review the judge's reasoning for edge cases

Report Format

Terminal Report

═══════════════════════════════════════════════
  SATISFACTION REPORT — 2026-02-20
═══════════════════════════════════════════════

  Overall: 0.91 (1365/1500 trajectories)

  Domain          Satisfaction  Threshold  Status
  ─────────────   ────────────  ─────────  ──────
  auth            0.96          0.95       PASS
  onboarding      0.87          0.75       PASS
  integrations    0.88          0.85       PASS
  payments        0.93          0.98       FAIL ▼

  Scenarios Below Threshold:
    payments/checkout-flow  0.93 (target: 0.98)
      → 7/100 unsatisfactory: timeout during payment confirmation
        not communicated to user (no retry UI)

═══════════════════════════════════════════════

JSON Report (for CI)

{
  "date": "2026-02-20",
  "overall_satisfaction": 0.91,
  "total_trajectories": 1500,
  "satisfactory_trajectories": 1365,
  "pass": false,
  "domains": {
    "auth": { "satisfaction": 0.96, "threshold": 0.95, "pass": true },
    "payments": { "satisfaction": 0.93, "threshold": 0.98, "pass": false }
  },
  "failing_scenarios": [
    {
      "id": "checkout-flow",
      "domain": "payments",
      "satisfaction": 0.93,
      "threshold": 0.98,
      "failure_summary": "timeout handling not communicated to user"
    }
  ]
}

References

references/judge-prompt-template.md — Default and domain-specific judge system prompts
references/report-schemas.md — JSON schemas for reports, judgments, and run manifests

satisfaction-metrics

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

satisfaction-metrics

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Satisfaction Metrics

When to Use

Why Satisfaction, Not Correctness

The Satisfaction Computation

Per-Trajectory

Per-Scenario

Per-Domain

Overall

LLM-as-Judge Methodology

Default Judge Configuration

Why Temperature 0?

Custom Judges

Thresholds

Configuration

Threshold Hierarchy

Threshold Semantics

Comparison Analysis

Run-over-Run Comparison

Trend Analysis

Statistical Considerations

Sample Size

Judge Reliability

Report Format

Terminal Report

JSON Report (for CI)

References

Similar Skills

Satisfaction Metrics

When to Use

Why Satisfaction, Not Correctness

The Satisfaction Computation

Per-Trajectory

Per-Scenario

Per-Domain

Overall

LLM-as-Judge Methodology

Default Judge Configuration

Why Temperature 0?

Custom Judges

Thresholds

Configuration

Threshold Hierarchy

Threshold Semantics

Comparison Analysis

Run-over-Run Comparison

Trend Analysis

Statistical Considerations

Sample Size

Judge Reliability

Report Format

Terminal Report

JSON Report (for CI)

References

Similar Skills