Skill

backtest

This skill should be used when the user asks to "backtest review skills", "test detection rate", "バックテスト", "レビュースキルをテスト", "検出率を測定", "過去のバグで検証", or wants to verify that generated review skills can detect known bugs by replaying historical states.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/tailored-reviewer:backtest

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Test generated review skills against historical bugs by replaying the codebase

SKILL.md

165 lines · ~1.6k tokens

Stats

LanguageShell

Parent stars0

MaintenanceExcellent

Last CommitMar 24, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Backtest: Review Skill Detection Testing

Test generated review skills against historical bugs by replaying the codebase state at the time each bug was introduced. Measures both recall (did we catch known bugs?) and precision (were our findings validated by subsequent fixes?).

Prerequisites:

Generated skills exist in .claude/skills/
workspace/ contains the project clone

Test Case Management

Adding Test Cases

When invoked with --add-case or when the user wants to add a test case:

Prompt for:

Source: Where was this bug found? (PR #, JIRA ticket, Sentry event, postmortem, Slack thread)
Commit: The commit that introduced the bug (or the fix commit to reverse-engineer from)
Description: What was the bug?
Expected detection: Which perspective(s) should catch this?

Append to backtest/test-cases.md:

### Case [N]: [brief title]
- **Source**: [PR/JIRA/Sentry/postmortem reference]
- **Bug commit**: [hash]
- **Fix commit**: [hash] (if available)
- **Description**: [what the bug was]
- **Expected perspective**: [which perspective should detect it]
- **Added**: [date]

Bulk Import

When importing from interview data:

Scan bug-patterns.md for commits with both bug-introducing and fix commits identified
Auto-generate test cases for each

Running Backtest

Execution Flow

For each test case in backtest/test-cases.md:

Checkout: cd workspace && git checkout {bug_commit} (the state WITH the bug)
Generate diff: git diff {bug_commit}~1 {bug_commit} (the buggy change)
Execute review: Run the review entry point (review-* skill in .claude/skills/) against this diff, with the --backtest context flag. The consolidation step will save the review to reviews/ as usual — backtest does NOT change the review output location.
Evaluate detection (recall): Did any finding match the known bug?
- Match criteria: same file, related description, severity >= Important
- Partial match: right area but wrong specific issue
- Miss: no finding related to the known bug
Evaluate precision (forward validation): For findings that DON'T match the known bug:
- Check if the cited code was modified in subsequent commits: git log {bug_commit}..{default_branch} -- {file_path}
- If modified: read the fix commit diff to determine if the finding's concern was addressed
- Validated: finding pointed to real code that was later changed to address the same concern
- Unvalidated: finding pointed to code that was never subsequently changed (may be false positive, or unfixed issue)
Restore: cd workspace && git checkout {default_branch}

IMPORTANT: In step 1, we checkout {bug_commit} (NOT {bug_commit}~1). The workspace must contain the buggy code so that the orchestrator's Phase 1.5 fact-check (workspace verification) can confirm the bug exists. If the workspace were at {bug_commit}~1, the buggy code wouldn't exist in workspace and all findings would be falsely dropped.

Results

The review itself is saved to reviews/ by the orchestrator (same as any normal review). The backtest evaluation (recall/precision analysis) is written separately to backtest/results/YYYY-MM-DD-{target}.md:

# Backtest Results: [date]

## Summary
- Test cases: N
- Detected (recall): N/N (X%)
- Partial: N (X%)
- Missed: N (X%)
- Additional findings: N
  - Validated by subsequent fixes: N (X%)
  - Unvalidated: N

## Per-Case Results

### Case [N]: [title]
- **Known bug result**: detected / partial / missed
- **Detecting perspective**: [which perspective found it, if any]
- **Finding**: [the relevant finding, if any]
- **Notes**: [why it was missed, if applicable]
- **Additional findings**: N
  - Validated: [list findings that were later fixed, with fix commit hash]
  - Unvalidated: [list findings with no subsequent fix]

## Analysis

### Recall (known bug detection)
[Perspectives or bug types with low detection rate]

### Precision (forward validation)
[Rate of findings validated by subsequent fixes]
[High validation rate = review is finding real issues]
[Low validation rate = review may be producing noise]

### Recommendations
[Specific suggestions for skill improvement based on misses and validation rates]

Learning Extraction (backtest後に自動実行)

backtestの結果からMISS/Partialを分析し、backtest/learnings.md に構造化して追記する。このファイルは build-skills と update-skills が読み込み、スキル生成に反映する。

抽出プロセス

各 MISS または Partial match について：

根本原因分析: なぜ検出できなかったか？
- どのパースペクティブが担当すべきだったか
- 既存のチェック項目の何が不足していたか
- どういうチェックがあれば検出できたか
パターン抽出: 再利用可能な検出ルールに変換
- 具体的なバグ→汎用的なチェックパターンに抽象化
- 例: 「locked issueガード欠如」→「同一データセットを処理する並列関数間で防御的チェックが非対称」
追記: backtest/learnings.md に以下の形式で追記

### Learning [N]: [パターン名]
- **Source**: backtest [date], Case [N] (MISS/Partial)
- **Bug**: [何が起きたか]
- **Root cause**: [なぜ検出できなかったか]
- **Check to add**: [具体的に何をチェックすべきか]
- **Target perspective**: [どのパースペクティブに追加すべきか]
- **Pattern type**: code-symmetry / state-transition / boundary-check / ...
- **Added**: [date]

既存の learning と重複する場合は追記しない。

Interpretation

Recall (known bug detection)

Detection rate > 70%: good for deployment
Detection rate 40-70%: usable but needs skill refinement
Detection rate < 40%: skills need significant rework, feed results to update-skills

Precision (forward validation)

Validation rate > 50%: excellent — review is finding real issues beyond the known bug
Validation rate 20-50%: good — some noise but meaningful signal
Validation rate < 20%: review may be producing too much noise

A review system with high recall AND high precision is genuinely useful — it catches known bugs and also surfaces issues that developers independently recognized and fixed.

Compare with previous backtest results to track improvement over time.

backtest

Invocation

Context Preview

SKILL.md

backtest

Invocation

Context Preview

SKILL.md

Backtest: Review Skill Detection Testing

Test Case Management

Adding Test Cases

Bulk Import

Running Backtest

Execution Flow

Results

Learning Extraction (backtest後に自動実行)

抽出プロセス

Interpretation

Recall (known bug detection)

Precision (forward validation)

Similar Skills

Backtest: Review Skill Detection Testing

Test Case Management

Adding Test Cases

Bulk Import

Running Backtest

Execution Flow

Results

Learning Extraction (backtest後に自動実行)

抽出プロセス

Interpretation

Recall (known bug detection)

Precision (forward validation)

Similar Skills