Skill

calibrate

Run cross-model calibration for autoimprove skills — compare Opus (gold standard) vs Haiku (cheap) on the same input to identify reasoning gaps. Use when the user says '/calibrate', 'calibrate skill', 'model calibration', or 'calibration gap'. Phase 1: hardcoded for adversarial-review only.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/autoimprove:calibrate adversarial-review <file|diff|pr NUMBER>

User invocable

Model invocable

Inline context

Default effort

Argument hintadversarial-review <file|diff|pr NUMBER>

Tool Access

This skill is limited to the following tools:

ReadGlobGrepBashAgent

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

<SKILL-GUARD>

SKILL.md

347 lines · ~2.9k tokens

Stats

LanguageShell

Stars2

MaintenanceExcellent

Last CommitJun 13, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Step 1: 🔍 Parse Arguments

Mark step in progress: TodoWrite([{id: "1", status: "in_progress"}])

From the user's input or the skill arguments, extract:

SKILL_NAME: the skill to calibrate (e.g., adversarial-review)
INPUT: file path, "diff", or a PR number

Phase 1 gate: If SKILL_NAME is NOT adversarial-review, output this message and stop:

Phase 1 calibration only supports `adversarial-review`. Generic skill wrapping is deferred to Phase 2.

If no SKILL_NAME is provided, default to adversarial-review.

Mark complete: TodoWrite([{id: "1", status: "completed"}])

Step 2: 📋 Gather Target Input

Mark step in progress: TodoWrite([{id: "2", status: "in_progress"}])

Collect the content to review and store as TARGET_INPUT.

If INPUT is "diff" or empty:

Run git diff HEAD in the repo directory
If that returns nothing, fall back to git diff --staged
If still empty, output: "Nothing to calibrate — working tree and staging area are clean." and stop.

If INPUT is a file path:

Read the file content
If file does not exist, output: "File not found: {INPUT}" and stop.

If INPUT is a PR number:

Validate INPUT matches ^[0-9]+$ — if not numeric, output: "Invalid PR number: {INPUT}" and stop.
Run gh pr diff {INPUT} to fetch the PR diff

Store the gathered content as TARGET_INPUT.

Mark complete: TodoWrite([{id: "2", status: "completed"}])

Step 3: 🔄 Run AR with Opus and Haiku in PARALLEL

Mark step in progress: TodoWrite([{id: "3", status: "in_progress"}])

CRITICAL: Do NOT call Skill('autoimprove:adversarial-review') — nested skill invocation does not support model override. Instead, spawn two agents with the AR steps inlined.

Spawn the following two agents in parallel (both at the same time):

Agent 1 — Opus (gold standard)

model: claude-opus-4-6
prompt: |
  Repo-local SAFETY rules apply. Before beginning, read `SAFETY.md` at the project root (it contains the experimenter safety contract for this repo) and treat its rules as non-negotiable. If `SAFETY.md` is not found, STOP and report — do not proceed without it.

  You are running an adversarial-style code review. Your job is to find ALL real issues in the following code or diff.

  <target>
  {TARGET_INPUT}
  </target>

  For each issue you find, produce a JSON finding. Be thorough and evidence-based — cite the exact code or reasoning behind each finding.

  Output ONLY the following JSON structure, nothing else:

  {
    "findings": [
      {
        "id": "F1",
        "severity": "critical|high|medium|low",
        "description": "clear, specific description of the issue",
        "evidence": "exact code snippet or reasoning that proves this is an issue",
        "file": "path/to/file or null",
        "line": 42
      }
    ]
  }

  If there are no issues, output: { "findings": [] }

Agent 2 — Haiku (cheap candidate)

model: claude-haiku-4-5-20251001
prompt: |
  Repo-local SAFETY rules apply. Before beginning, read `SAFETY.md` at the project root (it contains the experimenter safety contract for this repo) and treat its rules as non-negotiable. If `SAFETY.md` is not found, STOP and report — do not proceed without it.

  You are running an adversarial-style code review. Your job is to find ALL real issues in the following code or diff.

  <target>
  {TARGET_INPUT}
  </target>

  For each issue you find, produce a JSON finding. Be thorough and evidence-based — cite the exact code or reasoning behind each finding.

  Output ONLY the following JSON structure, nothing else:

  {
    "findings": [
      {
        "id": "F1",
        "severity": "critical|high|medium|low",
        "description": "clear, specific description of the issue",
        "evidence": "exact code snippet or reasoning that proves this is an issue",
        "file": "path/to/file or null",
        "line": 42
      }
    ]
  }

  If there are no issues, output: { "findings": [] }

Collect outputs as OPUS_RESULT and HAIKU_RESULT.

Mark complete: TodoWrite([{id: "3", status: "completed"}])

Step 4: 📊 Run Sonnet Evaluator

Mark step in progress: TodoWrite([{id: "4", status: "in_progress"}])

Spawn a Sonnet agent to compare the two outputs:

model: claude-sonnet-4-5
prompt: |
  You are a calibration evaluator comparing Opus (gold standard) and Haiku outputs for adversarial code review.

  ## Opus Output (gold standard)
  {OPUS_RESULT}

  ## Haiku Output (candidate)
  {HAIKU_RESULT}

  Evaluate the quality gap across four dimensions:
  1. What did Opus find that Haiku MISSED? (false negatives — Haiku's blind spots)
  2. What did Haiku flag that Opus dismissed? (false positives — noise Haiku adds)
  3. Where did Haiku findings lack depth or evidence compared to Opus?
  4. What specific, targeted prompt changes would close the gap?

  DEDUPLICATION: Before comparing, normalize findings semantically. Two findings describing the same issue with different wording count as ONE match — do not inflate missed_by_haiku with duplicates.

  IMPORTANT: The gap_score is diagnostic only. It must NEVER be used as a benchmark metric or to trigger automated grind loops.

  Output ONLY this JSON, nothing else:
  {
    "gap_score": <integer 0-10, where 0=identical quality and 10=completely different>,
    "haiku_find_rate": <float 0.0-1.0, fraction of Opus findings that Haiku also found>,
    "missed_by_haiku": [
      {
        "finding_id": "F1",
        "severity": "critical|high|medium|low",
        "description": "what was missed",
        "why_matters": "why this miss is significant"
      }
    ],
    "false_positives_haiku": [
      {
        "description": "what Haiku flagged",
        "reason_dismissed_by_opus": "why Opus correctly dismissed it"
      }
    ],
    "depth_gaps": [
      {
        "finding": "brief description of the finding",
        "opus_evidence": "what Opus cited",
        "haiku_evidence": "what Haiku cited",
        "gap": "what depth or specificity is missing from Haiku"
      }
    ],
    "prompt_improvements": [
      {
        "target": "agents/enthusiast.md OR agents/adversary.md OR agents/judge.md",
        "improvement": "specific text to add or change — be precise",
        "reason": "this addresses the gap because..."
      }
    ],
    "summary": "<one sentence: overall quality gap between Opus and Haiku>"
  }

Store result as GAP_REPORT.

If Sonnet output is not valid JSON (malformed, wrapped in fences, truncated): re-prompt once with: "Your response was not valid JSON. Return only the corrected JSON object, no markdown fences." If still invalid, store a signal with tags: ['signal:calibration', 'error:evaluator-malformed'] and output: "Calibration aborted — evaluator returned malformed JSON. Raw output logged." then stop.

Mark complete: TodoWrite([{id: "4", status: "completed"}])

Step 5: 📋 Display Gap Report

Mark step in progress: TodoWrite([{id: "5", status: "in_progress"}])

Parse GAP_REPORT and render the following output:

## Calibration Report — adversarial-review

**Gap Score:** {gap_score}/10  (target: <3)
**Haiku Find Rate:** {haiku_find_rate * 100}%  (target: ≥80%)

### Missed by Haiku ({count} findings)
{for each missed finding:}
  - [{SEVERITY}] {description} — {why_matters}

### False Positives from Haiku ({count})
{for each false positive:}
  - {description} — dismissed because: {reason_dismissed_by_opus}

### Depth Gaps ({count})
{for each depth gap:}
  - **{finding}**
    - Opus: {opus_evidence}
    - Haiku: {haiku_evidence}
    - Gap: {gap}

### Prompt Improvements Recommended ({count})
{for each improvement:}
  - **Target:** {target}
    **Change:** {improvement}
    **Reason:** {reason}

### Summary
{summary}

---
*Results are diagnostic only. gap_score is NOT a benchmark metric.*

Mark complete: TodoWrite([{id: "5", status: "completed", content: "📋 Display gap report — gap {gap_score}/10, find rate {haiku_find_rate_pct}%"}])

Step 6: 🏷️ Store LCM Signal

Mark step in progress: TodoWrite([{id: "6", status: "in_progress"}])

Store the calibration result for trend tracking and the autoimprove learning loop.

If the lcm_store MCP tool is available, call it with:

tags: ['signal:calibration', 'skill:adversarial-review', 'model:opus-vs-haiku', 'calibration:signal-only']
content: |
  gap_score: {gap_score}
  haiku_find_rate: {haiku_find_rate}
  missed_by_haiku_count: {count}
  false_positives_count: {count}
  prompt_improvements_count: {count}
  summary: {summary}

If lcm_store is NOT available, write the result to ~/.autoimprove/calibration/ as a dated JSON file:

Create the directory if it does not exist: mkdir -p ~/.autoimprove/calibration
Write file: ~/.autoimprove/calibration/{YYYY-MM-DD}-ar-calibration.json
Content: full GAP_REPORT JSON

Mark complete: TodoWrite([{id: "6", status: "completed"}])

Final Step - Cleanup

Before leaving the execution flow, close all todos explicitly:

TodoWrite([
  {id: "1", status: "completed"},
  {id: "2", status: "completed"},
  {id: "3", status: "completed"},
  {id: "4", status: "completed"},
  {id: "5", status: "completed"},
  {id: "6", status: "completed"}
])

CRITICAL Goodhart Boundary

The gap_score field is DIAGNOSTIC ONLY.
It MUST NEVER be added to autoimprove.yaml benchmarks.
It MUST NEVER be used as a grind loop metric or to influence theme selection weights.
The experimenter sees: "your prompt needs improvement at X because of Y structural reason."
The experimenter does NOT see numeric gap trends in the benchmark pipeline.

Arguments

Argument	What is calibrated
`adversarial-review diff`	Current working-tree diff (default input)
`adversarial-review <file-path>`	A specific file
`adversarial-review pr <number>`	A GitHub PR diff

Usage Examples

# Calibrate on the current working diff
/autoimprove:calibrate adversarial-review diff

# Calibrate on a specific file
/autoimprove:calibrate adversarial-review src/scripts/evaluate.sh

# Calibrate on a GitHub PR
/autoimprove:calibrate adversarial-review pr 42

Phase 1 Scope

Hardcoded for adversarial-review only. Generic skill wrapping (/autoimprove:calibrate idea-matrix, etc.) is deferred to Phase 2.

Phase 2 trigger: gap_score measured at least 3 times manually before automation.

calibrate

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

calibrate

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Step 1: 🔍 Parse Arguments

Step 2: 📋 Gather Target Input

Step 3: 🔄 Run AR with Opus and Haiku in PARALLEL

Agent 1 — Opus (gold standard)

Agent 2 — Haiku (cheap candidate)

Step 4: 📊 Run Sonnet Evaluator

Step 5: 📋 Display Gap Report

Step 6: 🏷️ Store LCM Signal

Final Step - Cleanup

CRITICAL Goodhart Boundary

Arguments

Usage Examples

Phase 1 Scope

Similar Skills

Step 1: 🔍 Parse Arguments

Step 2: 📋 Gather Target Input

Step 3: 🔄 Run AR with Opus and Haiku in PARALLEL

Agent 1 — Opus (gold standard)

Agent 2 — Haiku (cheap candidate)

Step 4: 📊 Run Sonnet Evaluator

Step 5: 📋 Display Gap Report

Step 6: 🏷️ Store LCM Signal

Final Step - Cleanup

CRITICAL Goodhart Boundary

Arguments

Usage Examples

Phase 1 Scope

Similar Skills