From mthines-agent-skills
Runs each PR finding through an independent confidence skill (not self-grading) and drops comments scoring below an 80% threshold to reduce noise.
How this agent operates — its isolation, permissions, and tool access model
Agent reference
mthines-agent-skills:agents/shared/rules/per-comment-confidenceThe summary Claude sees when deciding whether to delegate to this agent
The single LLM that wrote a finding is a poor judge of whether the finding is correct. The AAAI SELF-[IN]CORRECT result and Anthropic's own published guidance ("Pride and Prejudice", ACL 2024) both show that naïve self-grading either amplifies bias or adds no gain over self-consistency. The current 70 % per-comment threshold in the legacy reviewer was self-graded — exactly the failure mode the ...The single LLM that wrote a finding is a poor judge of whether the finding is correct. The AAAI SELF-[IN]CORRECT result and Anthropic's own published guidance ("Pride and Prejudice", ACL 2024) both show that naïve self-grading either amplifies bias or adds no gain over self-consistency. The current 70 % per-comment threshold in the legacy reviewer was self-graded — exactly the failure mode the literature warns against.
After the rewrite, per-comment confidence is routed through the dedicated confidence skill, run in code mode, with an 80 % drop threshold.
For each finding that survives finding-grounding.md and the dedupe pass in rubric-composition.md:
confidence(code) call with the finding as input:
<file:line>Skill("confidence", "code").< 80, drop the comment. Log the drop with the score.| Threshold | Source | Outcome |
|---|---|---|
| 70 | legacy reviewer.md Step 5.4 | Targets the published industry mean for self-graded threshold; produces 5–15 % false-positive rate (Crash Override 2026 LLM security review prompt study) |
| 80 | Claude Code Review default; 2026 FindSkill.ai field comparison | Targets the < 5 % false-positive rate above which devs read every comment |
| 90 | Bito / Qodo enterprise tier defaults | Drops too many true positives at typical SOTA model output quality; reserve for high-stakes-only repos |
80 is the recommended setting. Repos with .review.yaml overrides can tune via:
per_comment_confidence_threshold: 85 # default 80
confidence(code) returnsconfidence(code) scores three dimensions — Correctness (40 %), Completeness (30 %), No regressions (30 %) — and returns one weighted Final score (see skills/quality/confidence/SKILL.md § For code mode).
The drop decision is on the Final score.
A finding whose Final is dragged below 80 by any dimension is noise — a claim that is correct but incomplete, or complete but wrong, does not help the author.
def passes_confidence(final_score: int) -> bool:
# final_score = weighted average of Correctness (40%),
# Completeness (30%), No-regressions (30%)
return final_score >= 80
The acceptance-criteria questions in step 1 (accurate? actionable? helpful?) are the reviewer's rubric for framing the call — they are NOT scores the skill returns.
finding-grounding.md is designed to catch.rubric-composition.md before this step.comment-shape.md before this step.The pipeline runs strict left-to-right:
review pass
→ rubric-composition.md (dedupe + cap)
→ finding-grounding.md (claimed symbols exist?)
→ per-comment-confidence (Skill("confidence", "code") ≥ 80?)
→ conventional-comments.md (prefix prepend + decoration)
→ comment-shape.md (≤ 240 chars, ≤ 2 sentences, no structure?)
→ (PR Mode only) line-validity.md (hunk-bounds RIGHT-side check)
→ emit / post
Each step is a hard gate. A finding that fails any of them is dropped, with the drop logged in the terminal Quality Gate summary.
The Quality Gate summary in the agent's terminal output reports:
Quality Gate:
Findings produced: 24
Dedupe drops: 6
Grounding drops: 3
Confidence drops: 7 (avg score: 64)
Shape drops: 2
Final findings posted: 6
A run that posts 6 findings out of 24 produced is healthy. A run that posts 22 out of 24 is suspicious — the gates are not biting.
Expert Go code reviewer that analyzes diffs, runs go vet and staticcheck, and checks for idiomatic Go, concurrency bugs, error handling, and security issues.
npx claudepluginhub mthines/agent-skills