Search everything...

Stats

Actions

Available In

review-bench

Name: review-bench
Author: whichguy

By whichguy

Prompt/question A/B benchmarking and ablation tooling for review-suite.

npx claudepluginhub whichguy/claude-craft --plugin review-bench

Popularity

Stars

Above avg

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Agents5

compare-prompts-judge

/compare-prompts-judge

Pairwise LLM judge for compare-prompts skill. Evaluates two prompt outputs against the same input using 7 fixed qualitative criteria, derives a per-test winner by majority vote, and returns a structured JSON verdict with per-criterion scores. Spawned by compare-prompts skill for each test case. Receives PROMPT_A, PROMPT_B, INPUT, OUTPUT_A, and OUTPUT_B in its prompt. Returns JSON only — no prose. NOT for standalone use.

compare-questions-judge

/compare-questions-judge

Pairwise LLM judge for compare-questions skill. Evaluates two revised plans (produced by applying two different planning questions to the same original plan) on 5 research-informed criteria from a staff engineer's perspective, derives winner by majority vote, self-checks trial validity (recusing if unfair/inconclusive), and returns structured JSON. Spawned by compare-questions skill for each plan test case. Receives ORIGINAL_PLAN, QUESTION_A, QUESTION_B, REVISION_A, REVISION_B in its prompt. Returns JSON only — no prose. NOT for standalone use.

prompt-comparator

/prompt-comparator

Compare prompts across quality, token efficiency, and output effectiveness. **AUTOMATICALLY INVOKE** when user mentions: - "compare prompts", "which prompt is better", "prompt efficiency" - "A/B test prompts", "evaluate prompts", "test these prompts" - Multiple prompt variations to choose between **STRONGLY RECOMMENDED** for: - Optimizing prompt quality - Reducing token usage - Comparing alternative approaches - Before finalizing agent/skill prompts

question-bench-judge

/question-bench-judge

Position-blind comparative plan quality judge for question-bench skill. Compares two implementation plan versions on 8 fixed evaluation dimensions (Q-PQ1..Q-PQ8). Returns structured JSON only — no prose. Spawned by question-bench skill for each experiment comparison. Receives two plan versions as X/Y (labels randomized by skill). NOT for standalone use.

review-plan-ablation-judge

/review-plan-ablation-judge

Logical-equivalence judge for the review-plan ablation experiment. Compares two plan reviews — one from the structured question-based control, one from the directive-based ablated variant — on 5 criteria that measure whether the ablation silently drops, adds, or misweights issues relative to the control. Spawned by ablate-review-plan skill for each fixture. Receives CONTROL_REVIEW, ABLATED_REVIEW, and EXPECTED_FINDING (may be empty for real-world fixtures). Returns JSON only — no prose. NOT for standalone use.

Skills10

ablate-review-plan

/ablate-review-plan

Ablation test harness for the review-plan skill. Runs the structured question-based control (SKILL.md) and the directive-based ablated variant (SKILL-v-ablation-na.md, per-directive N/A semantics) against the same fixture set with k=3 repetition, then uses the review-plan-ablation-judge to compare outputs for logical equivalence and reports per-fixture stability. Answers: do the structured question IDs add meaningful signal, or does a directive prompt achieve equivalent issue detection? Spot-Check 1 Pass Criterion (input3b — clean-plan calibration anchor): - Majority winner = TIE (or SPLIT, treated as TIE) - Mode verdict_agreement = EQUIVALENT - Verdict stability = VERDICTS_STABLE (all 3 control runs same verdict AND all 3 ablated runs same verdict) - Winner stability = WINNER_STABLE OR WINNER_UNSTABLE (the latter is acceptable iff verdict-stability is VERDICTS_STABLE AND mode false_positives ∈ {EQUIVALENT, CONTROL}; surface in the false-positive summary as a signal, not a fail) - Mode false_positives in {EQUIVALENT, CONTROL} (ablated must not over-flag more than control; false_positives = X means side X had more FPs, so the allowed set excludes the over-flagging side) - >= 2 of 3 control AND >= 2 of 3 ablated runs reach PASS Tolerates one stochastic flip per side; detects systematic over-flagging via the false_positives mode. Brittle "all 3 PASS" rule replaced after probe-9/input3 calibration repair (see RESULTS.md 2026-05-02 entries). AUTOMATICALLY INVOKE when user mentions: - "ablation test", "ablate review-plan", "test directive variant" - "does the question structure matter", "directive vs questions"

compare-prompts

/compare-prompts

Compare two prompt versions (A vs B) by running both against a directory of test input files, then evaluating results on three dimensions in priority order: quality > tokens > time. **AUTOMATICALLY INVOKE** when user mentions: - "compare prompts", "which prompt is better", "prompt efficiency" - "A/B test prompts", "evaluate prompts", "test these prompts" - Multiple prompt variations to choose between **STRONGLY RECOMMENDED** for: - Optimizing prompt quality - Reducing token usage - Comparing alternative approaches - Before finalizing agent/skill prompts Position bias mitigated via randomized ordering per test case — judge sees A/B in random order, results remapped before aggregation.

compare-questions

/compare-questions

A/B test two planning questions by applying each to one or more plans, judging which question produces a better plan revision. Priority chain: quality > input tokens (question size) > time. AUTOMATICALLY INVOKE when user mentions: - "compare questions", "which question is better", "A/B test questions" - "evaluate questions", "test these questions against plans" - "question efficiency", "question comparison" - Two planning questions to compare against plan(s) STRONGLY RECOMMENDED for: - Optimizing review-plan question quality - Reducing question token cost while maintaining effectiveness - Choosing between alternative question phrasings - Validating new questions against existing ones Position bias mitigated via randomized ordering per test case — judge sees A/B in random order, results remapped before aggregation.

derive-questions

/derive-questions

Iteratively researches real software project failures and wins, extracts key planning questions via 5-whys analysis, validates them against synthetic test plans, judges their effectiveness via a parallel judge agent, refines them, and persists to a growing questions library. Builds a curated, language/system-agnostic question library that prevents known failure modes when applied during software planning. Supports multiple resumable runs: reads existing question inventory on startup, tracks researched failure domains, avoids near-duplicating existing questions. AUTOMATICALLY INVOKE when user mentions: - "derive questions", "derive planning questions", "research failure questions" - "build question library", "question library from failures" - "mine post-mortems", "extract planning questions"

improve-system-prompt

/improve-system-prompt

Benchmark and compare system prompt variants (V2/V2a/V2b/V2c) for Sheets Chat by running test scenarios through the real GAS-side ClaudeConversation pipeline. Tests both system-placement and user-placement, then evaluates with heuristic scoring (ABTestHarness) and LLM-as-judge. **AUTOMATICALLY INVOKE** when: - User says "benchmark system prompts", "compare prompt variants" - User wants to evaluate placement (system vs user message) - User says "which prompt variant is best", "run prompt benchmark" **NOT for:** General prompt engineering, non-GAS prompts, one-off prompt writing. Use /optimize-system-prompt for editing/refining the active prompt.

Stats

Version0.1.0

LanguageJavaScript

Stars2

Forks1

MaintenanceGood

LicenseMIT

Last CommitMay 3, 2026

AddedMay 4, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

claude-craft2

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

README

Claude Craft 🚀

Family-bundled Claude Code extensions, distributed as a plugin marketplace.

Eleven plugins covering Apps Script tooling, project wiki, plan/code review, prompt research bench, planning, async workflow, slides, and several domain-specific bundles. Install only what you need.

Install

/plugin marketplace add whichguy/claude-craft
/plugin install gas-suite@claude-craft         # pick the bundles you want
/plugin install review-suite@claude-craft
/plugin install wiki-suite@claude-craft
# … etc.

Verify with /plugin list.

Plugins

Bundle	What it provides
`gas-suite`	Apps Script review, debugging, planning, sidebar testing, Gmail Cards
`wiki-suite`	Project LLM wiki: ingest, query, process queue, lint, proactive research
`review-suite`	Plan review, code review (Adversarial Auditor), iterative review-fix loop
`review-bench`	Prompt/question A/B benchmarking and ablation tooling (depends on review-suite)
`planning-suite`	Architect, refactor, test, schedule-plan-tasks, node-plan, alias/unalias, performance, knowledge
`async-suite`	Background task workflow: `/bg`, `/todo`, task-persist, feedback-collector
`slides-suite`	reveal.js or Google Slides decks
`comms`	Slack tagging
`form990`	IRS Form 990 preparation orchestrator
`plan-red-team`	Iterative red-team plan review with Opus orchestration
`local-classifier`	Local Ollama-powered prompt classifier UserPromptSubmit hook

Cross-bundle dependency edges (declared in each plugin.json): gas-suite → review-suite, review-suite → wiki-suite, review-bench → review-suite, form990 → review-bench.

Upgrading from < 1.0 (symlink-based install)

If you previously ran ./install.sh, run the one-shot cleanup once before adding the marketplace — it removes hook entries injected into ~/.claude/settings.json and unlinks dangling symlinks pointing into the repo:

git -C path/to/claude-craft pull
path/to/claude-craft/tools/migrate-from-symlinks.sh

Then proceed with the /plugin marketplace add step above.

Wiki System

Claude Craft includes a self-building wiki system that captures knowledge from your sessions and makes it available across conversations.

Wiki Skills

Skill	Description
`/wiki-init`	Initialize a project wiki with directory structure and SCHEMA.md
`/wiki-ingest <source>`	Add a file or URL to the wiki (runs async in background)
`/wiki-query <question>`	Synthesize an answer from wiki pages with citations
`/wiki-load <topic>`	Load raw wiki pages into context (no synthesis overhead)
`/wiki-process`	Process pending queue entries — the self-building engine
`/wiki-lint`	Health check: find orphans, broken links, contradictions, stale pages

Prompt Improvement Skills

A consolidated set of skills for iterating on prompts, system prompts, and evaluator questions.

Skill	Description
`/improve-prompt`	Research-backed iterative prompt improvement loop with experiment variants, scope-preservation gate, and questions-based judging. Subsumes `/prompt-critique` (via `--mode critique`) and `/prompt-probes` (via `--with-probes`).
`/compare-prompts`	A/B test two prompts with execution-based scoring. Standalone harness.
`/process-feedback`	Ingest the `feedback-collector` plugin's backlog and propose surgical prompt updates (propose-only — never auto-edits SKILL.md).
`/optimize-system-prompt`	Optimize/refine the GAS Sheets Chat system prompt (compression + refinement). Subsumes `/ideate-system-prompt` via `--mode ideate` (autonomous hypothesis generation + benchmarking).
`/improve-system-prompt`	Benchmark pre-coded GAS system prompt variants (V2/V2a/V2b/V2c) against scenarios. Sibling of `/optimize-system-prompt` for projects with predefined variants.
`/derive-questions`	Mine failures and extract evaluator questions from real runs.
`/optimize-questions`	Token-efficiency optimization for plan-review questions. Uses `/compare-questions` as its internal A/B engine.
`/compare-questions`	Pairwise A/B testing of two evaluator questions against plan fixtures.

Wiki Plugin (wiki-hooks)

The wiki-hooks plugin provides 13 lifecycle handlers + a shared library that run automatically:

View full README on GitHub

review-bench

Popularity

What's Inside

Confidence

README

Claude Craft 🚀

Install

Plugins

Upgrading from < 1.0 (symlink-based install)

Wiki System

Wiki Skills

Prompt Improvement Skills

Wiki Plugin (wiki-hooks)

Similar Plugins

caveman

ui-design

llm-council-plugin

self-improving-agent

claude-mem

antigravity-bundle-web-designer

More by whichguy

planning-suite

wiki-suite

review-suite

gas-suite

async-suite

Claude Craft 🚀

Install

Plugins

Upgrading from < 1.0 (symlink-based install)

Wiki System

Wiki Skills

Prompt Improvement Skills

Wiki Plugin (wiki-hooks)

Popularity

Health & Quality

More by whichguy

planning-suite

wiki-suite

review-suite

gas-suite

async-suite

Similar Plugins

caveman

ui-design

llm-council-plugin

self-improving-agent

claude-mem

antigravity-bundle-web-designer