By whichguy
Prompt/question A/B benchmarking and ablation tooling for review-suite.
Pairwise LLM judge for compare-prompts skill. Evaluates two prompt outputs against the same input using 7 fixed qualitative criteria, derives a per-test winner by majority vote, and returns a structured JSON verdict with per-criterion scores. Spawned by compare-prompts skill for each test case. Receives PROMPT_A, PROMPT_B, INPUT, OUTPUT_A, and OUTPUT_B in its prompt. Returns JSON only — no prose. NOT for standalone use.
Pairwise LLM judge for compare-questions skill. Evaluates two revised plans (produced by applying two different planning questions to the same original plan) on 5 research-informed criteria from a staff engineer's perspective, derives winner by majority vote, self-checks trial validity (recusing if unfair/inconclusive), and returns structured JSON. Spawned by compare-questions skill for each plan test case. Receives ORIGINAL_PLAN, QUESTION_A, QUESTION_B, REVISION_A, REVISION_B in its prompt. Returns JSON only — no prose. NOT for standalone use.
Compare prompts across quality, token efficiency, and output effectiveness. **AUTOMATICALLY INVOKE** when user mentions: - "compare prompts", "which prompt is better", "prompt efficiency" - "A/B test prompts", "evaluate prompts", "test these prompts" - Multiple prompt variations to choose between **STRONGLY RECOMMENDED** for: - Optimizing prompt quality - Reducing token usage - Comparing alternative approaches - Before finalizing agent/skill prompts
Position-blind comparative plan quality judge for question-bench skill. Compares two implementation plan versions on 8 fixed evaluation dimensions (Q-PQ1..Q-PQ8). Returns structured JSON only — no prose. Spawned by question-bench skill for each experiment comparison. Receives two plan versions as X/Y (labels randomized by skill). NOT for standalone use.
Logical-equivalence judge for the review-plan ablation experiment. Compares two plan reviews — one from the structured question-based control, one from the directive-based ablated variant — on 5 criteria that measure whether the ablation silently drops, adds, or misweights issues relative to the control. Spawned by ablate-review-plan skill for each fixture. Receives CONTROL_REVIEW, ABLATED_REVIEW, and EXPECTED_FINDING (may be empty for real-world fixtures). Returns JSON only — no prose. NOT for standalone use.
Ablation test harness for the review-plan skill. Runs the structured question-based control (SKILL.md) and the directive-based ablated variant (SKILL-v-ablation-na.md, per-directive N/A semantics) against the same fixture set with k=3 repetition, then uses the review-plan-ablation-judge to compare outputs for logical equivalence and reports per-fixture stability. Answers: do the structured question IDs add meaningful signal, or does a directive prompt achieve equivalent issue detection? Spot-Check 1 Pass Criterion (input3b — clean-plan calibration anchor): - Majority winner = TIE (or SPLIT, treated as TIE) - Mode verdict_agreement = EQUIVALENT - Verdict stability = VERDICTS_STABLE (all 3 control runs same verdict AND all 3 ablated runs same verdict) - Winner stability = WINNER_STABLE OR WINNER_UNSTABLE (the latter is acceptable iff verdict-stability is VERDICTS_STABLE AND mode false_positives ∈ {EQUIVALENT, CONTROL}; surface in the false-positive summary as a signal, not a fail) - Mode false_positives in {EQUIVALENT, CONTROL} (ablated must not over-flag more than control; false_positives = X means side X had more FPs, so the allowed set excludes the over-flagging side) - >= 2 of 3 control AND >= 2 of 3 ablated runs reach PASS Tolerates one stochastic flip per side; detects systematic over-flagging via the false_positives mode. Brittle "all 3 PASS" rule replaced after probe-9/input3 calibration repair (see RESULTS.md 2026-05-02 entries). AUTOMATICALLY INVOKE when user mentions: - "ablation test", "ablate review-plan", "test directive variant" - "does the question structure matter", "directive vs questions"
Compare two prompt versions (A vs B) by running both against a directory of test input files, then evaluating results on three dimensions in priority order: quality > tokens > time. **AUTOMATICALLY INVOKE** when user mentions: - "compare prompts", "which prompt is better", "prompt efficiency" - "A/B test prompts", "evaluate prompts", "test these prompts" - Multiple prompt variations to choose between **STRONGLY RECOMMENDED** for: - Optimizing prompt quality - Reducing token usage - Comparing alternative approaches - Before finalizing agent/skill prompts Position bias mitigated via randomized ordering per test case — judge sees A/B in random order, results remapped before aggregation.
A/B test two planning questions by applying each to one or more plans, judging which question produces a better plan revision. Priority chain: quality > input tokens (question size) > time. AUTOMATICALLY INVOKE when user mentions: - "compare questions", "which question is better", "A/B test questions" - "evaluate questions", "test these questions against plans" - "question efficiency", "question comparison" - Two planning questions to compare against plan(s) STRONGLY RECOMMENDED for: - Optimizing review-plan question quality - Reducing question token cost while maintaining effectiveness - Choosing between alternative question phrasings - Validating new questions against existing ones Position bias mitigated via randomized ordering per test case — judge sees A/B in random order, results remapped before aggregation.
Iteratively researches real software project failures and wins, extracts key planning questions via 5-whys analysis, validates them against synthetic test plans, judges their effectiveness via a parallel judge agent, refines them, and persists to a growing questions library. Builds a curated, language/system-agnostic question library that prevents known failure modes when applied during software planning. Supports multiple resumable runs: reads existing question inventory on startup, tracks researched failure domains, avoids near-duplicating existing questions. AUTOMATICALLY INVOKE when user mentions: - "derive questions", "derive planning questions", "research failure questions" - "build question library", "question library from failures" - "mine post-mortems", "extract planning questions"
Benchmark and compare system prompt variants (V2/V2a/V2b/V2c) for Sheets Chat by running test scenarios through the real GAS-side ClaudeConversation pipeline. Tests both system-placement and user-placement, then evaluates with heuristic scoring (ABTestHarness) and LLM-as-judge. **AUTOMATICALLY INVOKE** when: - User says "benchmark system prompts", "compare prompt variants" - User wants to evaluate placement (system vs user message) - User says "which prompt variant is best", "run prompt benchmark" **NOT for:** General prompt engineering, non-GAS prompts, one-off prompt writing. Use /optimize-system-prompt for editing/refining the active prompt.
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Family-bundled Claude Code extensions, distributed as a plugin marketplace.
Eleven plugins covering Apps Script tooling, project wiki, plan/code review, prompt research bench, planning, async workflow, slides, and several domain-specific bundles. Install only what you need.
/plugin marketplace add whichguy/claude-craft
/plugin install gas-suite@claude-craft # pick the bundles you want
/plugin install review-suite@claude-craft
/plugin install wiki-suite@claude-craft
# … etc.
Verify with /plugin list.
| Bundle | What it provides |
|---|---|
gas-suite | Apps Script review, debugging, planning, sidebar testing, Gmail Cards |
wiki-suite | Project LLM wiki: ingest, query, process queue, lint, proactive research |
review-suite | Plan review, code review (Adversarial Auditor), iterative review-fix loop |
review-bench | Prompt/question A/B benchmarking and ablation tooling (depends on review-suite) |
planning-suite | Architect, refactor, test, schedule-plan-tasks, node-plan, alias/unalias, performance, knowledge |
async-suite | Background task workflow: /bg, /todo, task-persist, feedback-collector |
slides-suite | reveal.js or Google Slides decks |
comms | Slack tagging |
form990 | IRS Form 990 preparation orchestrator |
plan-red-team | Iterative red-team plan review with Opus orchestration |
local-classifier | Local Ollama-powered prompt classifier UserPromptSubmit hook |
Cross-bundle dependency edges (declared in each plugin.json):
gas-suite → review-suite, review-suite → wiki-suite,
review-bench → review-suite, form990 → review-bench.
If you previously ran ./install.sh, run the one-shot cleanup once before
adding the marketplace — it removes hook entries injected into
~/.claude/settings.json and unlinks dangling symlinks pointing into the
repo:
git -C path/to/claude-craft pull
path/to/claude-craft/tools/migrate-from-symlinks.sh
Then proceed with the /plugin marketplace add step above.
Claude Craft includes a self-building wiki system that captures knowledge from your sessions and makes it available across conversations.
| Skill | Description |
|---|---|
/wiki-init | Initialize a project wiki with directory structure and SCHEMA.md |
/wiki-ingest <source> | Add a file or URL to the wiki (runs async in background) |
/wiki-query <question> | Synthesize an answer from wiki pages with citations |
/wiki-load <topic> | Load raw wiki pages into context (no synthesis overhead) |
/wiki-process | Process pending queue entries — the self-building engine |
/wiki-lint | Health check: find orphans, broken links, contradictions, stale pages |
A consolidated set of skills for iterating on prompts, system prompts, and evaluator questions.
| Skill | Description |
|---|---|
/improve-prompt | Research-backed iterative prompt improvement loop with experiment variants, scope-preservation gate, and questions-based judging. Subsumes /prompt-critique (via --mode critique) and /prompt-probes (via --with-probes). |
/compare-prompts | A/B test two prompts with execution-based scoring. Standalone harness. |
/process-feedback | Ingest the feedback-collector plugin's backlog and propose surgical prompt updates (propose-only — never auto-edits SKILL.md). |
/optimize-system-prompt | Optimize/refine the GAS Sheets Chat system prompt (compression + refinement). Subsumes /ideate-system-prompt via --mode ideate (autonomous hypothesis generation + benchmarking). |
/improve-system-prompt | Benchmark pre-coded GAS system prompt variants (V2/V2a/V2b/V2c) against scenarios. Sibling of /optimize-system-prompt for projects with predefined variants. |
/derive-questions | Mine failures and extract evaluator questions from real runs. |
/optimize-questions | Token-efficiency optimization for plan-review questions. Uses /compare-questions as its internal A/B engine. |
/compare-questions | Pairwise A/B testing of two evaluator questions against plan fixtures. |
The wiki-hooks plugin provides 13 lifecycle handlers + a shared library that run automatically:
npx claudepluginhub whichguy/claude-craft --plugin review-benchArchitect, refactor, test, schedule-plan-tasks, node-plan, alias/unalias, performance, knowledge.
Project LLM wiki: ingest, query, process queue, lint, plus proactive research hook.
Plan review, code review (Adversarial Auditor), review-fix, security/red-team, memory audits.
Google Apps Script review, debugging, planning, sidebar testing, Gmail Cards.
Background task workflow: bg / todo / todo-cleanup + task-persist + feedback ingestion.
Ultra-compressed communication mode. Cuts ~75% of tokens while keeping full technical accuracy by speaking like a caveman.
Comprehensive UI/UX design plugin for mobile (iOS, Android, React Native) and web applications with design systems, accessibility, and modern patterns
Multi-model consensus engine integrating OpenAI Codex CLI, Gemini CLI, and Claude CLI for collaborative code review and problem-solving.
Curate auto-memory, promote learnings to CLAUDE.md and rules, extract proven patterns into reusable skills.
Memory compression system for Claude Code - persist context across sessions
Editorial "Web Designer" bundle for Claude Code from Antigravity Awesome Skills.