exploration-optimizer | exploration-cycle-plugin

Stats

Actions

Tags

exploration-optimizer | exploration-cycle-plugin

User wants to evaluate and improve a specific exploration skill. User: Evaluate and improve the exploration-session-brief skill using the optimization loop. Agent: [invokes exploration-optimizer, runs baseline-first iteration on exploration-session-brief] User notices a skill feels weak and wants a systematic improvement cycle. User: The exploration cycle feels slow — help me identify which skill to optimize first. Agent: [invokes exploration-optimizer, runs discovery phase to identify highest-impact target] BRD generation routes to business-requirements-capture, not this skill. User: Generate a BRD from our session captures. Agent: [invokes business-requirements-capture, NOT exploration-optimizer]

Exploration Optimizer

See acceptance criteria

Discovery Phase

Ask for:

The target exploration skill or agent to optimize.
The eval set to use, or whether to generate one from the current architecture.
The iteration budget.
Whether auto-apply of winning variants is allowed.
Which metrics matter most for this loop: routing quality, artifact usefulness, handoff stability, re-entry quality, or human intervention burden.
Whether post-run survey data exists and should be included in the decision.

Recap

Confirm:

target component
eval source
loop budget
chosen scoring dimensions
whether survey data is available
whether auto-apply is enabled

Execution

This skill implements autoresearch-style optimization for the exploration-cycle system. It uses a baseline-first iteration loop to improve skill prompts and logic.

Usage:

python3 ./scripts/execute.py \
  --target ${plugins}/skills/user-story-capture/SKILL.md \
  --eval-script ./scripts/eval_runner.py \
  --goal "Improve Gherkin block accuracy" \
  --iterations 3

For a concrete target-specific playbook, use references/spec-kitty-skill-optimizer-program.md when optimizing the Spec-Kitty agent/workflow files themselves.

Iteration Loop

The execute.py script follows a disciplined loop:

Change one dominant variable per iteration.
Re-run evaluations.
Mark the attempt as keep or discard.
If the run crashes or times out, log the failure and continue from the last known good state.
Never let a subjective preference override a clear regression in the tracked metrics.
Use survey feedback as a quality signal, not an excuse to ignore the baseline-first method.

Suggested Metrics

routing quality
artifact usefulness
handoff stability
re-entry usefulness
human intervention burden
unnecessary agent invocation rate
post-run survey composite score

Output

Always conclude execution with a Source Transparency Declaration explicitly listing what was queried to guarantee user trust: Sources Checked: [list] Sources Unavailable: [list]

Next Actions

Use ./scripts/benchmarking/run_loop.py --results-dir evals/experiments for repeatable improvement loops.
Suggest the user run audit-plugin to verify the generated artifacts.