From hillclimb
Interactive onboarding for a hill-climbing optimization project. Use this skill whenever the user wants to start a long, iterative effort to optimize a measurable score, minimize a loss, maximize accuracy, or beat a perf target. Trigger on phrases like "set up a hill-climbing project", "run the hillclimb-onboard skill", "start an iterative project", "let's iterate on X with a dashboard", or when the user describes a problem whose solution requires many rounds of try-verify-revise. This skill walks the user through one question at a time, fills the gaps with a research subagent, and produces .hillclimb/state.html (a self-rendering dashboard) plus a verify.sh script template.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hillclimb:hillclimb-onboardThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are setting up a long-running iterative project. The output is a
You are setting up a long-running iterative project. The output is a
self-contained <cwd>/.hillclimb/ directory: state.py (helper CLI),
state.html (dashboard), run_with_timeout.py (per-command time wrapper),
verify.sh, .gitignore. After onboarding the other
skills call <cwd>/.hillclimb/state.py directly, no skill discovery, no
_lib/.
SKILL_DIR="${CLAUDE_PLUGIN_ROOT}/skills/hillclimb-onboard"
SRC_STATE_PY="$SKILL_DIR/state.py"; TPL_DIR="$SKILL_DIR/verify_templates"
STATE_HTML="$PWD/.hillclimb/state.html"; STATE_PY="$PWD/.hillclimb/state.py"
If $STATE_HTML already exists, ask continue (patch missing fields
without re-asking the user about anything already set) or start over
(delete .hillclimb/).
Never silently overwrite. In particular: do not re-copy verify.sh
if it already exists in .hillclimb/, the user may have customised it.
If the user asks to change the template in continue mode, ask them
first.
Use AskUserQuestion for every question, one per call, never bundled.
After each answer, persist immediately via state.py (Step 4 has the
init block; run that right after Question 1 so $STATE_PY exists for
all subsequent persists). The user's attention drifts on long forms.
Skip questions whose answer is unambiguous from prior context. Infer the value, persist it silently, and rely on Step 6's confirm-summary to let the user fix it if you guessed wrong. Track which fields you inferred (vs. asked) so Step 6 can flag them. When in doubt, ask: the cost of one extra question is small, the cost of a silently wrong inference the user doesn't notice is larger.
| # | Field | Notes |
|---|---|---|
| 1 | Project name | short title for the dashboard |
| 2 | Objective description | 1-2 sentences on what success looks like |
| 3 | Direction | minimize or maximize. Almost always inferable from the objective: loss / error / MSE / RMSE / latency / cost / memory → minimize; accuracy / F1 / throughput / win-rate / speed → maximize. Only ask when the metric is genuinely ambiguous (a bare "score" can go either way — fraud/anomaly/error scores minimize, evaluation scores maximize). |
| 4 | Target (optional) | numeric threshold the user wants to hit; the loop stops when best.score crosses it. Skip the ask if the objective already names a number (e.g., "get loss below 0.05" → target 0.05) |
| 5 | Unit (optional) | short label (MSE, TFLOPs/sec, accuracy); free text via "Other"; falls back to SCORE on the chart. Skip the ask if the objective spells out the metric (e.g., "minimize MSE" → unit MSE) |
| 6 | Verifier template | metric (compute one number, threshold-gate); custom (empty starter, you fill in the JSON contract). "Other" lets the user supply a verifier command directly without a template. |
| 7 | Time limit per long-running command | minutes; presets 5 / 15 / 60 plus Other (free text). Stored as seconds = round(minutes * 60). Caps every long bash invocation execute/verify spawn — including the baseline run in Step 4b, so this must be asked before Q8. On timeout, the wrapper kills the process tree and verify records inconclusive |
| 8 | Baseline | the score on the unmodified tree. You run the verifier yourself (see Step 4b) — do not ask the user to paste a number. Greenfield projects skip baseline. The chart's progress bar fills from baseline → target. |
| 9 | Stop criteria | e.g., "score ≤ 0.05", "all checks pass", "user satisfied". Free text for the dashboard; the loop's automatic target-met stop reads objective.target. Skip the ask if it mirrors the target (target 0.05 + minimize → "score ≤ 0.05") |
| 10 | Initial ideas | ≥ 3 distinct directions; push back on 1-2 |
Persist examples (use after Step 4's init; same set shape for the rest):
python3 "$STATE_PY" set "$STATE_HTML" project.objective.direction '"maximize"'
python3 "$STATE_PY" set "$STATE_HTML" project.objective.target 1.0
python3 "$STATE_PY" set "$STATE_HTML" project.baseline '{"score": 0.42, "notes": "linear regression baseline"}'
python3 "$STATE_PY" set "$STATE_HTML" project.time_limit_seconds 900
python3 "$STATE_PY" append-idea "$STATE_HTML" '{"title":"feature engineering","description":"add lag features","priority":"high"}' --added-by onboard
After you have the project name and objective description, launch a
general-purpose subagent that reads existing files in $PWD, compares
the user's objective against typical pitfalls (is the metric automatable?
is the baseline reproducible? are the initial ideas concrete?), and
returns ≤ 5 specific gaps with one suggested follow-up question each.
Cap response at ~200 words.
Then ask the user one gap question at a time via AskUserQuestion.
Persist each answer; stop when the user pushes back or all gaps addressed.
.hillclimb/After Question 1:
mkdir -p "$PWD/.hillclimb"
python3 "$SRC_STATE_PY" init "$STATE_HTML" --name "<project name>"
cp "$SRC_STATE_PY" "$STATE_PY" && chmod +x "$STATE_PY"
cp "$SKILL_DIR/run_with_timeout.py" "$PWD/.hillclimb/run_with_timeout.py"
chmod +x "$PWD/.hillclimb/run_with_timeout.py"
After Question 6 (verifier template), wire it up. Pick metric.sh or
custom.sh from $TPL_DIR/ based on the user's answer.
Refuse to overwrite an existing verify.sh, if one is already there,
ask the user before replacing.
TEMPLATE="metric.sh" # or custom.sh, per the user's answer
DEST="$PWD/.hillclimb/verify.sh"
if [ -e "$DEST" ]; then
echo "verify.sh already exists, leaving it in place"
else
cp "$TPL_DIR/$TEMPLATE" "$DEST" && chmod +x "$DEST"
fi
python3 "$STATE_PY" set "$STATE_HTML" project.verifier.command '"bash .hillclimb/verify.sh"'
If the user picked "Other" (no template, custom command), skip the cp and
set project.verifier.command to whatever shell command they supplied. The
command must emit a single JSON line on stdout: {"status":"pass"|"fail"|"inconclusive","score":<num>,"notes":"..."}.
From here on use $STATE_PY (the project-local copy). That's what the
other skills will use, so the smoke-test path matches.
Run the verifier yourself and write the score to project.baseline.
The user does not paste a number — reproducibility comes from us running
the same command that hillclimb-verify will run on every iteration.
Continue mode: if project.baseline.score is already non-null when
Step 4b begins (user is re-onboarding to add ideas, not rebaseline),
skip this whole step. Re-running here would overwrite a real baseline
captured against a different code state and silently invalidate every
prior run's "improved?" call.
Otherwise, gate readiness via AskUserQuestion with three options:
Show the gate for all template choices, including Other. A
user-supplied Other command can be just as unfinished as a template
stub (e.g., it shells out to a script that isn't written yet); always
let the user gate it.
For reference: the metric template ships with EVAL_CMD='python eval.py'
and the custom template emits valid JSON with score: 0.0 from a stub
— both would parse successfully and silently write a junk baseline if
you skipped the gate.
If the user picks greenfield:
python3 "$STATE_PY" set "$STATE_HTML" project.baseline '{"score": null, "notes": "greenfield, no baseline"}'
If the user picks edit first, wait, then re-show the gate.
If the user picks ready to run, invoke the verifier exactly the way
hillclimb-verify does (so the baseline is comparable to later runs).
Time limit comes from project.time_limit_seconds, which Question 7
already set:
STATE_JSON=$(python3 "$STATE_PY" read "$STATE_HTML")
CMD=$(printf '%s' "$STATE_JSON" | python3 -c 'import json,sys; print(json.load(sys.stdin)["project"]["verifier"]["command"])')
TIMEOUT=$(printf '%s' "$STATE_JSON" | python3 -c 'import json,sys; print(json.load(sys.stdin)["project"].get("time_limit_seconds") or 600)')
python3 .hillclimb/run_with_timeout.py "$TIMEOUT" bash -c "$CMD" 2>&1
Run this as a single Bash tool call in the foreground
(run_in_background: false). Pass the Bash tool's timeout ≈
TIMEOUT*1000 + 10000 ms. The verifier's combined stdout/stderr lands
in the tool's output (call it RAW); the tool's reported exit code is
the wrapper's exit (call it EXIT_CODE). Don't RAW=$(...) it inside
the shell — the variable would be lost before you could read it.
Triage EXIT_CODE (mirrors hillclimb-verify Step 3). On any failure
path, the agent picks one of three concrete writes and runs the matching
state.py set — never the literal string "none":
EXIT_CODE | Action |
|---|---|
0, parsed last-line JSON has a numeric score | set project.baseline '{"score": <score>, "notes": "onboarding baseline run"}' — success path. |
0, no parseable JSON or no score field | Show the raw tail. AskUserQuestion: retry / enter manual score / skip baseline. |
124 (timeout) | Show the timeout message. AskUserQuestion: retry with a larger time_limit_seconds / enter manual score / skip baseline. Do not auto-write anything. |
| other nonzero | Show the raw tail. AskUserQuestion: retry after fixing verify.sh / enter manual score / skip baseline. |
The three follow-up choices map to concrete writes:
time_limit_seconds). Loop until success or the
user picks another option.set project.baseline '{"score": <num>, "notes": "manual entry; verifier did not produce a score"}'."baseline skipped; verifier failed at onboarding".Never silently coerce a failed verifier into score: 0 — a wrong
baseline poisons every future "improved?" check.
.hillclimb/.gitignoreIgnore the dashboard plus scratch outputs. state.html is gitignored so
git checkout/reset can never destroy run history; the dashboard lives
only on disk.
cat > "$PWD/.hillclimb/.gitignore" <<'EOF'
state.html
*.tmp
*.log
__pycache__/
EOF
python3 "$STATE_PY" read "$STATE_HTML"
Render a compact bullet summary: project name, objective (description,
direction, target, unit), baseline, stop criteria, verifier command, time
limit (seconds / 60 minutes), ideas (id, title, priority).
Mark any field you inferred rather than asked with a trailing
(inferred) tag in the summary, so the user knows where to look hard.
A silently-wrong guess is the failure mode this step exists to catch.
Then one AskUserQuestion: confirm or revise a specific field.
On revise, update via state.py set (or append-idea) and loop. Don't
proceed until the user explicitly picks confirm.
file://<cwd>/.hillclimb/state.htmlhillclimb-verify — changing it after a real baseline
makes scores incomparable..hillclimb/state.py and .hillclimb/run_with_timeout.py are
project-local helpers, don't move or delete.hillclimb-execute skill to try the first idea.state.html by hand. Always go through state.py.hillclimb-execute. Don't re-run the
verifier to compare; that's hillclimb-verify.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub ntt123/hill-climbing-skills --plugin hillclimb