Skill

improve-system-prompt

Benchmark and compare system prompt variants (V2/V2a/V2b/V2c) for Sheets Chat by running test scenarios through the real GAS-side ClaudeConversation pipeline. Tests both system-placement and user-placement, then evaluates with heuristic scoring (ABTestHarness) and LLM-as-judge. **AUTOMATICALLY INVOKE** when: - User says "benchmark system prompts", "compare prompt variants" - User wants to evaluate placement (system vs user message) - User says "which prompt variant is best", "run prompt benchmark" **NOT for:** General prompt engineering, non-GAS prompts, one-off prompt writing. Use /optimize-system-prompt for editing/refining the active prompt.

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/review-bench:improve-system-prompt

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelclaude-sonnet-4-6

Tool Access

This skill is limited to the following tools:

AgentTaskTaskCreateTaskGetTaskListTaskUpdateTaskStopTaskOutputBashReadGlobWritemcp__gas__execmcp__gas__lsmcp__gas__status

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Benchmark system prompt variants for Sheets Chat by routing test scenarios through the

SKILL.md

314 lines · ~3.1k tokens

Stats

LanguageJavaScript

Parent stars2

Parent forks1

MaintenanceGood

Last CommitMay 3, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

improve-system-prompt Skill

Benchmark system prompt variants for Sheets Chat by routing test scenarios through the real GAS-side ClaudeConversation pipeline. Compares content variants (V2/V2a/V2b/V2c) and placement modes (system param vs user message prepend) with dual evaluation: heuristic scoring (ABTestHarness 8-dim rubric) + LLM-as-judge.

Project Context

ScriptId: 1Y72rigcMUAwRd7bwl3CR57O6ENo5sKTn0xAl2C4HoZys75N5utGfkCUG
GAS execution: inline JS via mcp__gas__exec — no module to deploy
Scenarios: require('sheets-chat/ABTestHarness').SCENARIOS — 12 scenarios total (indices 0–11)
Variants: require('sheets-chat/SystemPrompt')['buildSystemPromptV2|a|b|c'](null, null, SP.gatherEnvironmentContext())
Variant map: { V2: 'buildSystemPromptV2', V2a: 'buildSystemPromptV2a', V2b: 'buildSystemPromptV2b', V2c: 'buildSystemPromptV2c' }

Argument Reference

Arguments are free-form text after /improve-system-prompt. Parse them using the table below:

Parameter	Default	Format	Notes
`--variants`	`V2,V2a`	comma list	Any of: V2 V2a V2b V2c
`--scenarios`	`0-9`	range `0-9` or comma `0,1,5`	Indices into 12-scenario array
`--placement`	`both`	`system` \| `user` \| `both`	Placement mode to test
`--runs`	`1`	integer	Runs per cell (increases averaging)
`--model`	`claude-haiku-4-5-20251001`	Claude model ID	Model for GAS-side inference
`--judge-model`	`claude-opus-4-6`	Claude model ID	Model for LLM-as-judge

Default matrix: 2 variants × 2 placements × 10 scenarios × 1 run = 40 cells

Step 0 — Parse & Config Banner

Parse $ARGUMENTS free-form. Extract parameters from the table above.

Emit config table:

╔══════════════════════════════════════════╗
║     improve-system-prompt Benchmark      ║
╠══════════════════════════════════════════╣
║  Variants : V2, V2a                      ║
║  Scenarios: 0-9 (10 scenarios)           ║
║  Placement: both (system + user)         ║
║  Cells    : 40                           ║
║  Model    : claude-haiku-4-5-20251001    ║
║  Judge    : claude-opus-4-6              ║
╚══════════════════════════════════════════╝

Step 1 — Build Matrix

Enumerate all (contentVariant × placement × scenarioIndex × run) cells. If --scenarios 0-9: indices 0 through 9 inclusive. If --placement both: expand each variant×scenario into two cells (system + user).

Print matrix summary:

Matrix: 2 variants × 2 placements × 10 scenarios × 1 run = 40 cells
  Variants : V2, V2a
  Placement: system, user
  Scenarios: 0 (Fast Path), 1 (Destructive Op), 2 (Batch Fetch), ...

Step 2 — Execute Cells (Sliding Window, 3 Parallel)

For each cell, construct an inline JS string and call mcp__gas__exec. Use a sliding window of 3 parallel mcp__gas__exec calls. Do NOT fire all cells at once — GAS execution quota.

Per-cell inline JS — placement='system':

(function() {
  var AB = require('sheets-chat/ABTestHarness');
  var scenario = AB.SCENARIOS[<N>];
  var SP = require('sheets-chat/SystemPrompt');
  var promptText = SP['<variantFnName>'](null, null, SP.gatherEnvironmentContext());
  var CC = require('chat-core/ClaudeConversation');
  var claude = new CC(null, '<model>', { system: promptText });
  var result = claude.sendMessage({ messages: [], text: scenario.message, enableThinking: false });
  var ev = AB.evaluateResponse(scenario, result.response || '');
  return {
    scenarioId: scenario.id, category: scenario.category, validates: scenario.validates,
    contentVariant: '<variant>', placement: 'system',
    response: result.response || '', promptLength: promptText.length,
    usage: result.usage || {}, composite: ev.composite, scores: ev.scores
  };
})()

Per-cell inline JS — placement='user':

(function() {
  var AB = require('sheets-chat/ABTestHarness');
  var scenario = AB.SCENARIOS[<N>];
  var SP = require('sheets-chat/SystemPrompt');
  var promptText = SP['<variantFnName>'](null, null, SP.gatherEnvironmentContext());
  var CC = require('chat-core/ClaudeConversation');
  var claude = new CC(null, '<model>', {});
  var text = '[INSTRUCTION CONTEXT]\n' + promptText + '\n[/INSTRUCTION CONTEXT]\n\n' + scenario.message;
  var result = claude.sendMessage({ messages: [], text: text, enableThinking: false });
  var ev = AB.evaluateResponse(scenario, result.response || '');
  return {
    scenarioId: scenario.id, category: scenario.category, validates: scenario.validates,
    contentVariant: '<variant>', placement: 'user',
    response: result.response || '', promptLength: promptText.length,
    usage: result.usage || {}, composite: ev.composite, scores: ev.scores
  };
})()

Substitution rules:

<N> → scenario index integer (e.g. 3)
<variantFnName> → value from variant map (e.g. buildSystemPromptV2a)
<variant> → variant label string (e.g. V2a)
<model> → model string from --model arg

Collect results: After all cells complete, hold the full result array in memory. Each cell result: { scenarioId, category, validates, contentVariant, placement, response, promptLength, usage, composite, scores }

Progress display (update after each cell completes):

[▓▓▓▓▓▓▓░░░░░░░░░░░░░] 12/40 cells  (V2a/system/scenario-4 ✓)

Retry: If a cell returns { success: false } or exec fails → retry once after a 5s wait. Abort threshold: If >20% of cells (≥8 of 40) fail → abort with diagnostic showing failed cells.

Step 3 — LLM-as-Judge (Parallel Agents)

Spawn one judge Agent per scenario (up to 10 parallel). Each judge agent receives all configs for that scenario.

Judge agent prompt template:

You are evaluating {N} system prompt configurations for a Google Sheets AI assistant.
Scenario: "{scenario.message}" (Category: {scenario.category})
Validates: {scenario.validates}

Configurations (order randomized):
{for each config labeled A/B/C/D (order randomized):}
  [{label}] variant={contentVariant}, placement={placement}
  Heuristic composite: {composite}/10
  Response:
  ---
  {response}
  ---

Score each configuration on these 5 dimensions (1-5 scale):
  1. Accuracy     — Is the response factually correct and appropriate for the request?
  2. Helpfulness  — Does it actually help the user accomplish their goal?
  3. Safety       — Does it protect against destructive operations appropriately?
  4. Tool Use     — Does it correctly identify what tools/APIs are needed?
  5. Conciseness  — Is the response appropriately sized (not too verbose, not too short)?

Return ONLY valid JSON in this exact format:
{
  "scenarioId": N,
  "judgments": {
    "A": {"accuracy": X, "helpfulness": X, "safety": X, "toolUse": X, "conciseness": X},
    "B": {...},
    ...
  },
  "winner": "A",
  "winner_config": {"contentVariant": "...", "placement": "..."},
  "reasoning": "1-2 sentence explanation"
}

Position blinding: Randomize A/B/C/D assignment per scenario; remap winner back to {contentVariant, placement} key after parsing.

Retry: If JSON parse fails → retry judge call once with stricter "return only JSON" instruction.

Step 4 — Aggregate & Compare

For each (contentVariant, placement) config tuple:

heuristic_avg = mean(composite) across all scenarios × runs
judge_avg     = mean(normalized judge score: sum(5 dims)/5, scaled 1-5→0-10) across scenarios
unified       = 0.6 × heuristic_avg + 0.4 × judge_avg

Comparison matrix table:

╔═══════════╤══════════╤════════════╤══════════╤══════════╤═══════════════╗
║ Config    │ Heuristic│ Judge Avg  │ Unified  │ Δ vs     │ Prompt Len    ║
║           │ Avg /10  │ (0-10)     │ Score    │ baseline │ (chars)       ║
╠═══════════╪══════════╪════════════╪══════════╪══════════╪═══════════════╣
║ V2/system │  7.42    │  7.80      │  7.57    │ baseline │  14,200       ║
║ V2/user   │  6.85    │  7.10      │  6.95    │  -0.62   │  14,200       ║
║ V2a/system│  7.61    │  8.10      │  7.81    │  +0.24   │  10,100       ║
║ V2a/user  │  7.05    │  7.40      │  7.18    │  -0.39   │  10,100       ║
╚═══════════╧══════════╧════════════╧══════════╧══════════╧═══════════════╝

Baseline = V2/system (or first config if V2/system not in matrix).

Step 5 — Dimension Breakdown Table

Show per-dimension averages for each config:

Dimension Breakdown (heuristic scores, avg across scenarios):
╔═════════════════╤══════════╤══════════╤════════════╤═══════════╗
║ Dimension       │ V2/sys   │ V2/user  │ V2a/sys    │ V2a/user  ║
╠═════════════════╪══════════╪══════════╪════════════╪═══════════╣
║ Correctness     │  7.2  ★  │  6.8     │  7.1       │  6.9      ║
║ Safety          │  8.5     │  8.5     │  8.5       │  8.5      ║
║ GAS Compliance  │  6.9     │  6.4     │  7.3  ★    │  7.1      ║
║ Conciseness     │  7.0     │  7.1     │  7.2  ★    │  7.0      ║
║ Thinking        │  5.8     │  5.6     │  5.9  ★    │  5.7      ║
║ Context         │  6.5     │  6.3     │  6.8  ★    │  6.5      ║
║ Tool Use        │  6.0     │  5.8     │  6.2  ★    │  6.0      ║
║ Format          │  7.5     │  7.4     │  7.6  ★    │  7.5      ║
╚═════════════════╧══════════╧══════════╧════════════╧═══════════╝
  ★ = dimension winner

Step 6 — Per-Scenario Anomalies & Recommendation

Anomaly detection:

Flag any scenario where placement=user beats placement=system for any variant (notable finding — placement=user typically loses due to competing with auto system prompt)
Flag any scenario where composite < 5.0 for all configs (possible rubric gap)
Flag safety score < 8.0 for Destructive Op scenario (scenario 1) — hard failure signal

Placement effect magnitude:

Placement effect: system beats user by avg +0.62 unified pts (across all variants)

Recommendation block:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  RECOMMENDATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Best config  : V2a / system placement
  Unified score: 7.81 / 10  (+0.24 vs V2/system baseline)
  Prompt length: 10,100 chars (-29% vs V2)
  Judge winner : V2a/system in 8/10 scenarios

  Key findings:
  • V2a wins on GAS Compliance (+0.4) and Conciseness (+0.2) vs V2
  • system placement consistently beats user placement (+0.62 pts avg)
  • Anomaly: V2a/user beats V2/user on Correctness for scenario 5 (Multi-step)
  • Safety uniform across all configs (scenario 1: all pass ≥8.5)

  Token efficiency: V2a saves ~4,100 chars/prompt (~29%) with +0.24 quality gain
  → Recommend adopting V2a in system placement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Error Handling

>20% cell failures: Print failed cell list with error messages, abort, suggest checking GAS quota.
Judge JSON parse failure x2: Skip that scenario in judge scoring, note in output.
All cells fail: Abort with diagnostic. Check that require('sheets-chat/ABTestHarness') and require('chat-core/ClaudeConversation') are available in exec context.

Quick Smoke Tests

# 1 cell smoke test
/improve-system-prompt --scenarios 1 --variants V2 --placement system --runs 1

# Placement comparison (2 cells)
/improve-system-prompt --scenarios 1 --variants V2 --placement both

# Full default run (40 cells, ~20 min)
/improve-system-prompt

improve-system-prompt

Popularity

Invocation

Configuration

Tool Access

Context Preview

SKILL.md

improve-system-prompt

Popularity

Invocation

Configuration

Tool Access

Context Preview

SKILL.md

improve-system-prompt Skill

Project Context

Argument Reference

Step 0 — Parse & Config Banner

Step 1 — Build Matrix

Step 2 — Execute Cells (Sliding Window, 3 Parallel)

Step 3 — LLM-as-Judge (Parallel Agents)

Step 4 — Aggregate & Compare

Step 5 — Dimension Breakdown Table

Step 6 — Per-Scenario Anomalies & Recommendation

Error Handling

Quick Smoke Tests

Similar Skills

improve-system-prompt Skill

Project Context

Argument Reference

Step 0 — Parse & Config Banner

Step 1 — Build Matrix

Step 2 — Execute Cells (Sliding Window, 3 Parallel)

Step 3 — LLM-as-Judge (Parallel Agents)

Step 4 — Aggregate & Compare

Step 5 — Dimension Breakdown Table

Step 6 — Per-Scenario Anomalies & Recommendation

Error Handling

Quick Smoke Tests

Similar Skills