From autoimprove
Run cross-model calibration for autoimprove skills — compare Opus (gold standard) vs Haiku (cheap) on the same input to identify reasoning gaps. Use when the user says '/calibrate', 'calibrate skill', 'model calibration', or 'calibration gap'. Phase 1: hardcoded for adversarial-review only.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoimprove:calibrate adversarial-review <file|diff|pr NUMBER>adversarial-review <file|diff|pr NUMBER>This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<SKILL-GUARD>
Run cross-model calibration: compare Opus (gold standard) vs Haiku (cheap) on the same adversarial-review input to surface reasoning gaps and actionable prompt improvements.
Initialize progress tracking:
TodoWrite([
{id: "1", content: "🔍 Parse arguments", status: "pending"},
{id: "2", content: "📋 Gather target input", status: "pending"},
{id: "3", content: "🔄 Run Opus & Haiku in parallel", status: "pending"},
{id: "4", content: "📊 Run Sonnet evaluator", status: "pending"},
{id: "5", content: "📋 Display gap report", status: "pending"},
{id: "6", content: "🏷️ Store LCM signal", status: "pending"}
])
Mark step in progress: TodoWrite([{id: "1", status: "in_progress"}])
From the user's input or the skill arguments, extract:
SKILL_NAME: the skill to calibrate (e.g., adversarial-review)INPUT: file path, "diff", or a PR numberPhase 1 gate: If SKILL_NAME is NOT adversarial-review, output this message and stop:
Phase 1 calibration only supports `adversarial-review`. Generic skill wrapping is deferred to Phase 2.
If no SKILL_NAME is provided, default to adversarial-review.
Mark complete: TodoWrite([{id: "1", status: "completed"}])
Mark step in progress: TodoWrite([{id: "2", status: "in_progress"}])
Collect the content to review and store as TARGET_INPUT.
If INPUT is "diff" or empty:
git diff HEAD in the repo directorygit diff --stagedIf INPUT is a file path:
If INPUT is a PR number:
^[0-9]+$ — if not numeric, output: "Invalid PR number: {INPUT}" and stop.gh pr diff {INPUT} to fetch the PR diffStore the gathered content as TARGET_INPUT.
Mark complete: TodoWrite([{id: "2", status: "completed"}])
Mark step in progress: TodoWrite([{id: "3", status: "in_progress"}])
CRITICAL: Do NOT call Skill('autoimprove:adversarial-review') — nested skill invocation does not support model override. Instead, spawn two agents with the AR steps inlined.
Spawn the following two agents in parallel (both at the same time):
model: claude-opus-4-6
prompt: |
Repo-local SAFETY rules apply. Before beginning, read `SAFETY.md` at the project root (it contains the experimenter safety contract for this repo) and treat its rules as non-negotiable. If `SAFETY.md` is not found, STOP and report — do not proceed without it.
You are running an adversarial-style code review. Your job is to find ALL real issues in the following code or diff.
<target>
{TARGET_INPUT}
</target>
For each issue you find, produce a JSON finding. Be thorough and evidence-based — cite the exact code or reasoning behind each finding.
Output ONLY the following JSON structure, nothing else:
{
"findings": [
{
"id": "F1",
"severity": "critical|high|medium|low",
"description": "clear, specific description of the issue",
"evidence": "exact code snippet or reasoning that proves this is an issue",
"file": "path/to/file or null",
"line": 42
}
]
}
If there are no issues, output: { "findings": [] }
model: claude-haiku-4-5-20251001
prompt: |
Repo-local SAFETY rules apply. Before beginning, read `SAFETY.md` at the project root (it contains the experimenter safety contract for this repo) and treat its rules as non-negotiable. If `SAFETY.md` is not found, STOP and report — do not proceed without it.
You are running an adversarial-style code review. Your job is to find ALL real issues in the following code or diff.
<target>
{TARGET_INPUT}
</target>
For each issue you find, produce a JSON finding. Be thorough and evidence-based — cite the exact code or reasoning behind each finding.
Output ONLY the following JSON structure, nothing else:
{
"findings": [
{
"id": "F1",
"severity": "critical|high|medium|low",
"description": "clear, specific description of the issue",
"evidence": "exact code snippet or reasoning that proves this is an issue",
"file": "path/to/file or null",
"line": 42
}
]
}
If there are no issues, output: { "findings": [] }
Collect outputs as OPUS_RESULT and HAIKU_RESULT.
Mark complete: TodoWrite([{id: "3", status: "completed"}])
Mark step in progress: TodoWrite([{id: "4", status: "in_progress"}])
Spawn a Sonnet agent to compare the two outputs:
model: claude-sonnet-4-5
prompt: |
You are a calibration evaluator comparing Opus (gold standard) and Haiku outputs for adversarial code review.
## Opus Output (gold standard)
{OPUS_RESULT}
## Haiku Output (candidate)
{HAIKU_RESULT}
Evaluate the quality gap across four dimensions:
1. What did Opus find that Haiku MISSED? (false negatives — Haiku's blind spots)
2. What did Haiku flag that Opus dismissed? (false positives — noise Haiku adds)
3. Where did Haiku findings lack depth or evidence compared to Opus?
4. What specific, targeted prompt changes would close the gap?
DEDUPLICATION: Before comparing, normalize findings semantically. Two findings describing the same issue with different wording count as ONE match — do not inflate missed_by_haiku with duplicates.
IMPORTANT: The gap_score is diagnostic only. It must NEVER be used as a benchmark metric or to trigger automated grind loops.
Output ONLY this JSON, nothing else:
{
"gap_score": <integer 0-10, where 0=identical quality and 10=completely different>,
"haiku_find_rate": <float 0.0-1.0, fraction of Opus findings that Haiku also found>,
"missed_by_haiku": [
{
"finding_id": "F1",
"severity": "critical|high|medium|low",
"description": "what was missed",
"why_matters": "why this miss is significant"
}
],
"false_positives_haiku": [
{
"description": "what Haiku flagged",
"reason_dismissed_by_opus": "why Opus correctly dismissed it"
}
],
"depth_gaps": [
{
"finding": "brief description of the finding",
"opus_evidence": "what Opus cited",
"haiku_evidence": "what Haiku cited",
"gap": "what depth or specificity is missing from Haiku"
}
],
"prompt_improvements": [
{
"target": "agents/enthusiast.md OR agents/adversary.md OR agents/judge.md",
"improvement": "specific text to add or change — be precise",
"reason": "this addresses the gap because..."
}
],
"summary": "<one sentence: overall quality gap between Opus and Haiku>"
}
Store result as GAP_REPORT.
If Sonnet output is not valid JSON (malformed, wrapped in fences, truncated): re-prompt once with: "Your response was not valid JSON. Return only the corrected JSON object, no markdown fences." If still invalid, store a signal with tags: ['signal:calibration', 'error:evaluator-malformed'] and output: "Calibration aborted — evaluator returned malformed JSON. Raw output logged." then stop.
Mark complete: TodoWrite([{id: "4", status: "completed"}])
Mark step in progress: TodoWrite([{id: "5", status: "in_progress"}])
Parse GAP_REPORT and render the following output:
## Calibration Report — adversarial-review
**Gap Score:** {gap_score}/10 (target: <3)
**Haiku Find Rate:** {haiku_find_rate * 100}% (target: ≥80%)
### Missed by Haiku ({count} findings)
{for each missed finding:}
- [{SEVERITY}] {description} — {why_matters}
### False Positives from Haiku ({count})
{for each false positive:}
- {description} — dismissed because: {reason_dismissed_by_opus}
### Depth Gaps ({count})
{for each depth gap:}
- **{finding}**
- Opus: {opus_evidence}
- Haiku: {haiku_evidence}
- Gap: {gap}
### Prompt Improvements Recommended ({count})
{for each improvement:}
- **Target:** {target}
**Change:** {improvement}
**Reason:** {reason}
### Summary
{summary}
---
*Results are diagnostic only. gap_score is NOT a benchmark metric.*
Mark complete: TodoWrite([{id: "5", status: "completed", content: "📋 Display gap report — gap {gap_score}/10, find rate {haiku_find_rate_pct}%"}])
Mark step in progress: TodoWrite([{id: "6", status: "in_progress"}])
Store the calibration result for trend tracking and the autoimprove learning loop.
If the lcm_store MCP tool is available, call it with:
tags: ['signal:calibration', 'skill:adversarial-review', 'model:opus-vs-haiku', 'calibration:signal-only']
content: |
gap_score: {gap_score}
haiku_find_rate: {haiku_find_rate}
missed_by_haiku_count: {count}
false_positives_count: {count}
prompt_improvements_count: {count}
summary: {summary}
If lcm_store is NOT available, write the result to ~/.autoimprove/calibration/ as a dated JSON file:
mkdir -p ~/.autoimprove/calibration~/.autoimprove/calibration/{YYYY-MM-DD}-ar-calibration.jsonMark complete: TodoWrite([{id: "6", status: "completed"}])
Before leaving the execution flow, close all todos explicitly:
TodoWrite([
{id: "1", status: "completed"},
{id: "2", status: "completed"},
{id: "3", status: "completed"},
{id: "4", status: "completed"},
{id: "5", status: "completed"},
{id: "6", status: "completed"}
])
gap_score field is DIAGNOSTIC ONLY.autoimprove.yaml benchmarks.| Argument | What is calibrated |
|---|---|
adversarial-review diff | Current working-tree diff (default input) |
adversarial-review <file-path> | A specific file |
adversarial-review pr <number> | A GitHub PR diff |
# Calibrate on the current working diff
/autoimprove:calibrate adversarial-review diff
# Calibrate on a specific file
/autoimprove:calibrate adversarial-review src/scripts/evaluate.sh
# Calibrate on a GitHub PR
/autoimprove:calibrate adversarial-review pr 42
Hardcoded for adversarial-review only. Generic skill wrapping (/autoimprove:calibrate idea-matrix, etc.) is deferred to Phase 2.
Phase 2 trigger: gap_score measured at least 3 times manually before automation.
npx claudepluginhub tokyo-megacorp/autoimproveCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.