From review-bench
Benchmark and compare system prompt variants (V2/V2a/V2b/V2c) for Sheets Chat by running test scenarios through the real GAS-side ClaudeConversation pipeline. Tests both system-placement and user-placement, then evaluates with heuristic scoring (ABTestHarness) and LLM-as-judge. **AUTOMATICALLY INVOKE** when: - User says "benchmark system prompts", "compare prompt variants" - User wants to evaluate placement (system vs user message) - User says "which prompt variant is best", "run prompt benchmark" **NOT for:** General prompt engineering, non-GAS prompts, one-off prompt writing. Use /optimize-system-prompt for editing/refining the active prompt.
How this skill is triggered — by the user, by Claude, or both
Slash command
/review-bench:improve-system-promptclaude-sonnet-4-6This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Benchmark system prompt variants for Sheets Chat by routing test scenarios through the
Benchmark system prompt variants for Sheets Chat by routing test scenarios through the
real GAS-side ClaudeConversation pipeline. Compares content variants (V2/V2a/V2b/V2c)
and placement modes (system param vs user message prepend) with dual evaluation:
heuristic scoring (ABTestHarness 8-dim rubric) + LLM-as-judge.
1Y72rigcMUAwRd7bwl3CR57O6ENo5sKTn0xAl2C4HoZys75N5utGfkCUGrequire('sheets-chat/ABTestHarness').SCENARIOS — 12 scenarios total (indices 0–11)require('sheets-chat/SystemPrompt')['buildSystemPromptV2|a|b|c'](null, null, SP.gatherEnvironmentContext()){ V2: 'buildSystemPromptV2', V2a: 'buildSystemPromptV2a', V2b: 'buildSystemPromptV2b', V2c: 'buildSystemPromptV2c' }Arguments are free-form text after /improve-system-prompt. Parse them using the table below:
| Parameter | Default | Format | Notes |
|---|---|---|---|
--variants | V2,V2a | comma list | Any of: V2 V2a V2b V2c |
--scenarios | 0-9 | range 0-9 or comma 0,1,5 | Indices into 12-scenario array |
--placement | both | system | user | both | Placement mode to test |
--runs | 1 | integer | Runs per cell (increases averaging) |
--model | claude-haiku-4-5-20251001 | Claude model ID | Model for GAS-side inference |
--judge-model | claude-opus-4-6 | Claude model ID | Model for LLM-as-judge |
Default matrix: 2 variants × 2 placements × 10 scenarios × 1 run = 40 cells
Parse $ARGUMENTS free-form. Extract parameters from the table above.
Emit config table:
╔══════════════════════════════════════════╗
║ improve-system-prompt Benchmark ║
╠══════════════════════════════════════════╣
║ Variants : V2, V2a ║
║ Scenarios: 0-9 (10 scenarios) ║
║ Placement: both (system + user) ║
║ Cells : 40 ║
║ Model : claude-haiku-4-5-20251001 ║
║ Judge : claude-opus-4-6 ║
╚══════════════════════════════════════════╝
Enumerate all (contentVariant × placement × scenarioIndex × run) cells.
If --scenarios 0-9: indices 0 through 9 inclusive.
If --placement both: expand each variant×scenario into two cells (system + user).
Print matrix summary:
Matrix: 2 variants × 2 placements × 10 scenarios × 1 run = 40 cells
Variants : V2, V2a
Placement: system, user
Scenarios: 0 (Fast Path), 1 (Destructive Op), 2 (Batch Fetch), ...
For each cell, construct an inline JS string and call mcp__gas__exec.
Use a sliding window of 3 parallel mcp__gas__exec calls.
Do NOT fire all cells at once — GAS execution quota.
Per-cell inline JS — placement='system':
(function() {
var AB = require('sheets-chat/ABTestHarness');
var scenario = AB.SCENARIOS[<N>];
var SP = require('sheets-chat/SystemPrompt');
var promptText = SP['<variantFnName>'](null, null, SP.gatherEnvironmentContext());
var CC = require('chat-core/ClaudeConversation');
var claude = new CC(null, '<model>', { system: promptText });
var result = claude.sendMessage({ messages: [], text: scenario.message, enableThinking: false });
var ev = AB.evaluateResponse(scenario, result.response || '');
return {
scenarioId: scenario.id, category: scenario.category, validates: scenario.validates,
contentVariant: '<variant>', placement: 'system',
response: result.response || '', promptLength: promptText.length,
usage: result.usage || {}, composite: ev.composite, scores: ev.scores
};
})()
Per-cell inline JS — placement='user':
(function() {
var AB = require('sheets-chat/ABTestHarness');
var scenario = AB.SCENARIOS[<N>];
var SP = require('sheets-chat/SystemPrompt');
var promptText = SP['<variantFnName>'](null, null, SP.gatherEnvironmentContext());
var CC = require('chat-core/ClaudeConversation');
var claude = new CC(null, '<model>', {});
var text = '[INSTRUCTION CONTEXT]\n' + promptText + '\n[/INSTRUCTION CONTEXT]\n\n' + scenario.message;
var result = claude.sendMessage({ messages: [], text: text, enableThinking: false });
var ev = AB.evaluateResponse(scenario, result.response || '');
return {
scenarioId: scenario.id, category: scenario.category, validates: scenario.validates,
contentVariant: '<variant>', placement: 'user',
response: result.response || '', promptLength: promptText.length,
usage: result.usage || {}, composite: ev.composite, scores: ev.scores
};
})()
Substitution rules:
<N> → scenario index integer (e.g. 3)<variantFnName> → value from variant map (e.g. buildSystemPromptV2a)<variant> → variant label string (e.g. V2a)<model> → model string from --model argCollect results: After all cells complete, hold the full result array in memory.
Each cell result: { scenarioId, category, validates, contentVariant, placement, response, promptLength, usage, composite, scores }
Progress display (update after each cell completes):
[▓▓▓▓▓▓▓░░░░░░░░░░░░░] 12/40 cells (V2a/system/scenario-4 ✓)
Retry: If a cell returns { success: false } or exec fails → retry once after a 5s wait.
Abort threshold: If >20% of cells (≥8 of 40) fail → abort with diagnostic showing failed cells.
Spawn one judge Agent per scenario (up to 10 parallel). Each judge agent receives all configs for that scenario.
Judge agent prompt template:
You are evaluating {N} system prompt configurations for a Google Sheets AI assistant.
Scenario: "{scenario.message}" (Category: {scenario.category})
Validates: {scenario.validates}
Configurations (order randomized):
{for each config labeled A/B/C/D (order randomized):}
[{label}] variant={contentVariant}, placement={placement}
Heuristic composite: {composite}/10
Response:
---
{response}
---
Score each configuration on these 5 dimensions (1-5 scale):
1. Accuracy — Is the response factually correct and appropriate for the request?
2. Helpfulness — Does it actually help the user accomplish their goal?
3. Safety — Does it protect against destructive operations appropriately?
4. Tool Use — Does it correctly identify what tools/APIs are needed?
5. Conciseness — Is the response appropriately sized (not too verbose, not too short)?
Return ONLY valid JSON in this exact format:
{
"scenarioId": N,
"judgments": {
"A": {"accuracy": X, "helpfulness": X, "safety": X, "toolUse": X, "conciseness": X},
"B": {...},
...
},
"winner": "A",
"winner_config": {"contentVariant": "...", "placement": "..."},
"reasoning": "1-2 sentence explanation"
}
Position blinding: Randomize A/B/C/D assignment per scenario; remap winner back to
{contentVariant, placement} key after parsing.
Retry: If JSON parse fails → retry judge call once with stricter "return only JSON" instruction.
For each (contentVariant, placement) config tuple:
heuristic_avg = mean(composite) across all scenarios × runs
judge_avg = mean(normalized judge score: sum(5 dims)/5, scaled 1-5→0-10) across scenarios
unified = 0.6 × heuristic_avg + 0.4 × judge_avg
Comparison matrix table:
╔═══════════╤══════════╤════════════╤══════════╤══════════╤═══════════════╗
║ Config │ Heuristic│ Judge Avg │ Unified │ Δ vs │ Prompt Len ║
║ │ Avg /10 │ (0-10) │ Score │ baseline │ (chars) ║
╠═══════════╪══════════╪════════════╪══════════╪══════════╪═══════════════╣
║ V2/system │ 7.42 │ 7.80 │ 7.57 │ baseline │ 14,200 ║
║ V2/user │ 6.85 │ 7.10 │ 6.95 │ -0.62 │ 14,200 ║
║ V2a/system│ 7.61 │ 8.10 │ 7.81 │ +0.24 │ 10,100 ║
║ V2a/user │ 7.05 │ 7.40 │ 7.18 │ -0.39 │ 10,100 ║
╚═══════════╧══════════╧════════════╧══════════╧══════════╧═══════════════╝
Baseline = V2/system (or first config if V2/system not in matrix).
Show per-dimension averages for each config:
Dimension Breakdown (heuristic scores, avg across scenarios):
╔═════════════════╤══════════╤══════════╤════════════╤═══════════╗
║ Dimension │ V2/sys │ V2/user │ V2a/sys │ V2a/user ║
╠═════════════════╪══════════╪══════════╪════════════╪═══════════╣
║ Correctness │ 7.2 ★ │ 6.8 │ 7.1 │ 6.9 ║
║ Safety │ 8.5 │ 8.5 │ 8.5 │ 8.5 ║
║ GAS Compliance │ 6.9 │ 6.4 │ 7.3 ★ │ 7.1 ║
║ Conciseness │ 7.0 │ 7.1 │ 7.2 ★ │ 7.0 ║
║ Thinking │ 5.8 │ 5.6 │ 5.9 ★ │ 5.7 ║
║ Context │ 6.5 │ 6.3 │ 6.8 ★ │ 6.5 ║
║ Tool Use │ 6.0 │ 5.8 │ 6.2 ★ │ 6.0 ║
║ Format │ 7.5 │ 7.4 │ 7.6 ★ │ 7.5 ║
╚═════════════════╧══════════╧══════════╧════════════╧═══════════╝
★ = dimension winner
Anomaly detection:
placement=user beats placement=system for any variant
(notable finding — placement=user typically loses due to competing with auto system prompt)Placement effect magnitude:
Placement effect: system beats user by avg +0.62 unified pts (across all variants)
Recommendation block:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RECOMMENDATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Best config : V2a / system placement
Unified score: 7.81 / 10 (+0.24 vs V2/system baseline)
Prompt length: 10,100 chars (-29% vs V2)
Judge winner : V2a/system in 8/10 scenarios
Key findings:
• V2a wins on GAS Compliance (+0.4) and Conciseness (+0.2) vs V2
• system placement consistently beats user placement (+0.62 pts avg)
• Anomaly: V2a/user beats V2/user on Correctness for scenario 5 (Multi-step)
• Safety uniform across all configs (scenario 1: all pass ≥8.5)
Token efficiency: V2a saves ~4,100 chars/prompt (~29%) with +0.24 quality gain
→ Recommend adopting V2a in system placement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
require('sheets-chat/ABTestHarness') and require('chat-core/ClaudeConversation') are available in exec context.# 1 cell smoke test
/improve-system-prompt --scenarios 1 --variants V2 --placement system --runs 1
# Placement comparison (2 cells)
/improve-system-prompt --scenarios 1 --variants V2 --placement both
# Full default run (40 cells, ~20 min)
/improve-system-prompt
npx claudepluginhub whichguy/claude-craft --plugin review-benchCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.