From great_cto
Filters false positives from security audits, QA regressions, and high-stakes judgments using a 3-round self-challenge and arbiter pattern.
How this skill is triggered — by the user, by Claude, or both
Slash command
/great_cto:skeptical-triageWhen to use
Apply skeptical triage when: - A finding could block a gate (gate:code, gate:ship, gate:qa, gate:arch) and flipping it wrongly wastes CTO time - A verdict is about to be written to a report that downstream agents will trust (CSO, QA, ADR) - Multiple signals disagree (one reviewer says VALID, another says INVALID) — arbiter resolves cleanly Do NOT apply to: - P2 findings or advisory notes (cost > benefit) - Hard findings (secrets in source/git, confirmed CVEs, failing tests) — these are facts, not judgments - Quick factual lookups ("does this file exist?", "what version is pinned?")
docs/**src/**lib/**app/**This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Filter false positives from multi-angle review, security audit, QA regression flags, or any high-stakes judgment before it turns into a blocker.
Filter false positives from multi-angle review, security audit, QA regression flags, or any high-stakes judgment before it turns into a blocker.
Three rounds of skeptical self-review + an impartial arbiter, with a confidence score from the vote.
| Caller | Finding type | Apply triage? |
|---|---|---|
/review | Angle 2/4/7/9 P0/P1 (security, SQL, privacy, concurrency) | Yes |
/review --deep | Any angle P0/P1 | Yes |
security-officer | CSO audit P0/P1 | Yes |
security-officer | Secret in source/git, confirmed CVE | No — hard finding |
qa-engineer | Flaky-test verdict (is this a regression or flake?) | Yes |
architect | ADR trade-off dispute (option A vs. B when both look reasonable) | Yes |
| Any | P2/advisory | No |
Run these sequentially. Each round sees prior reasoning. Arbiter sees all rounds.
Question: is the premise true?
Output: {round: 1, verdict: VALID|INVALID|UNCERTAIN, reasoning: "...", crux: "single key fact"}
Question: are claimed defenses real and sufficient?
Grep to find its actual implementation line.MAX_BUF_SIZE is not a verified bound — #define MAX_BUF_SIZE 64 is.If you cannot point to the line that enforces the defense, it does not exist.
Output: same JSON shape, with grep_used: true/false.
Question: what did Rounds 1-2 not consider?
Output: same JSON shape.
Input: all 3 rounds + original finding/question + source code.
Question: final call — which side has the stronger evidence?
verdict: VALID|INVALID (no UNCERTAIN — make the call).crux — the key fact the verdict turns on.Output:
{
"verdict": "VALID",
"crux": "memcpy at auth.c:142 copies network-controlled len bytes into 64-byte stack buffer with no bound check",
"reasoning": "Rounds 1 and 3 verified attacker reach; Round 2 found no size check in 50 LOC radius; arbiter confirms no caller clamps len."
}
Burn these into every round's prompt:
#define / const declaration.confidence = valid_rounds_before_arbiter / 3
100% (VVV) — 3/3 rounds VALID. Arbiter rubber-stamps unless it finds something brand-new.67% (VVI or VIV or IVV) — majority VALID. Arbiter breaks tie with new evidence.33% (IIV or IVI or VII) — majority INVALID. Arbiter usually confirms INVALID.0% (III) — 3/3 INVALID. Arbiter rarely overrides.Arbiter overrides the final verdict; confidence reflects the round vote for transparency. Record both in the output so humans can see where the arbiter diverged.
Once the arbiter returns:
| Arbiter verdict | Confidence | Severity action |
|---|---|---|
VALID | ≥ 50% | Keep original severity |
VALID | < 50% | Demote: P0→P1, P1→P2 |
INVALID | any | Remove from gate tally, record as [FILTERED] in report for audit |
UNCERTAIN (only if arbiter could not decide) | n/a | Keep original severity, flag for manual CTO review |
Every caller logs triage results to .great_cto/triage-log.jsonl (append-only, one JSON per line):
{
"timestamp": "2026-04-19T12:34:56Z",
"caller": "review|security-officer|qa-engineer|architect",
"finding_id": "SEC-042",
"file": "src/auth.c:142",
"original_severity": "P0",
"rounds": [
{"round": 1, "verdict": "VALID", "crux": "..."},
{"round": 2, "verdict": "VALID", "crux": "...", "grep_used": true},
{"round": 3, "verdict": "INVALID", "crux": "..."}
],
"arbiter": {"verdict": "VALID", "crux": "..."},
"confidence": 0.67,
"final_severity": "P0"
}
This log is how we measure whether triage earns its keep. Review it weekly:
# False-positive rate: how many findings the arbiter flipped to INVALID
jq 'select(.arbiter.verdict=="INVALID")' .great_cto/triage-log.jsonl | wc -l
# Average rounds-to-consensus (did we need all 3 or did R1+R2 agree?)
jq '[.rounds[].verdict] | unique | length' .great_cto/triage-log.jsonl
If FP rate < 10% after 50 triages — triage is filtering noise that wasn't there. Lower threshold or skip triage for that angle. If FP rate > 40% — original review prompt is too trigger-happy; tighten the angle rules.
Per triaged finding: ~4 LLM turns (3 rounds + arbiter). At typical review sizes (~5-10 triaged findings per PR), total budget: 20-40 extra turns per /review. Batch when possible — one arbiter can handle multiple findings in a single call if their cruxes are independent.
For cost-sensitive runs (approval-level: auto on a huge PR), consider: triage only P0, leave P1 untriaged. Re-tune based on .great_cto/triage-log.jsonl data.
confidence (the vote) and final_verdict (the arbiter). Humans deserve to see the disagreement.npx claudepluginhub avelikiy/great_ctoAdversarially reviews any artifact (design docs, code, PRs, docs) by dispatching fresh Devil's Advocate subagents iteratively until clean.
Verifies code review findings with an independent second opinion. Supports dismiss (false positive check), confirm, and clarify intents with automated or human-gated verdicts.
Subjects non-trivial decisions to a fresh-context adversarial review before finalizing. Use for high-stakes code, unfamiliar logic, or when correctness outweighs speed.