Skill

hillclimb-onboard

Interactive onboarding for a hill-climbing optimization project. Use this skill whenever the user wants to start a long, iterative effort to optimize a measurable score, minimize a loss, maximize accuracy, or beat a perf target. Trigger on phrases like "set up a hill-climbing project", "run the hillclimb-onboard skill", "start an iterative project", "let's iterate on X with a dashboard", or when the user describes a problem whose solution requires many rounds of try-verify-revise. This skill walks the user through one question at a time, fills the gaps with a research subagent, and produces .hillclimb/state.html (a self-rendering dashboard) plus a verify.sh script template.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/hillclimb:hillclimb-onboard

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are setting up a long-running iterative project. The output is a

Supporting Files

run_with_timeout.pystate.pytemplate.htmlverify_templates/custom.shverify_templates/metric.sh

SKILL.md

262 lines · ~3.3k tokens

Stats

LanguageHTML

Stars1

MaintenanceExcellent

Last CommitMay 27, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

hillclimb-onboard: set up a hill-climbing project

You are setting up a long-running iterative project. The output is a self-contained <cwd>/.hillclimb/ directory: state.py (helper CLI), state.html (dashboard), run_with_timeout.py (per-command time wrapper), verify.sh, .gitignore. After onboarding the other skills call <cwd>/.hillclimb/state.py directly, no skill discovery, no _lib/.

Step 1: Locate bundled assets, handle existing state

SKILL_DIR="${CLAUDE_PLUGIN_ROOT}/skills/hillclimb-onboard"
SRC_STATE_PY="$SKILL_DIR/state.py"; TPL_DIR="$SKILL_DIR/verify_templates"
STATE_HTML="$PWD/.hillclimb/state.html"; STATE_PY="$PWD/.hillclimb/state.py"

If $STATE_HTML already exists, ask continue (patch missing fields without re-asking the user about anything already set) or start over (delete .hillclimb/).

Never silently overwrite. In particular: do not re-copy verify.sh if it already exists in .hillclimb/, the user may have customised it. If the user asks to change the template in continue mode, ask them first.

Step 2: Interview the user, ONE question at a time

Use AskUserQuestion for every question, one per call, never bundled. After each answer, persist immediately via state.py (Step 4 has the init block; run that right after Question 1 so $STATE_PY exists for all subsequent persists). The user's attention drifts on long forms.

Skip questions whose answer is unambiguous from prior context. Infer the value, persist it silently, and rely on Step 6's confirm-summary to let the user fix it if you guessed wrong. Track which fields you inferred (vs. asked) so Step 6 can flag them. When in doubt, ask: the cost of one extra question is small, the cost of a silently wrong inference the user doesn't notice is larger.

#	Field	Notes
1	Project name	short title for the dashboard
2	Objective description	1-2 sentences on what success looks like
3	Direction	`minimize` or `maximize`. Almost always inferable from the objective: loss / error / MSE / RMSE / latency / cost / memory → `minimize`; accuracy / F1 / throughput / win-rate / speed → `maximize`. Only ask when the metric is genuinely ambiguous (a bare "score" can go either way — fraud/anomaly/error scores minimize, evaluation scores maximize).
4	Target (optional)	numeric threshold the user wants to hit; the loop stops when `best.score` crosses it. Skip the ask if the objective already names a number (e.g., "get loss below 0.05" → target `0.05`)
5	Unit (optional)	short label (`MSE`, `TFLOPs/sec`, `accuracy`); free text via "Other"; falls back to `SCORE` on the chart. Skip the ask if the objective spells out the metric (e.g., "minimize MSE" → unit `MSE`)
6	Verifier template	`metric` (compute one number, threshold-gate); `custom` (empty starter, you fill in the JSON contract). "Other" lets the user supply a verifier command directly without a template.
7	Time limit per long-running command	minutes; presets `5 / 15 / 60` plus `Other` (free text). Stored as `seconds = round(minutes * 60)`. Caps every long bash invocation execute/verify spawn — including the baseline run in Step 4b, so this must be asked before Q8. On timeout, the wrapper kills the process tree and verify records `inconclusive`
8	Baseline	the score on the unmodified tree. You run the verifier yourself (see Step 4b) — do not ask the user to paste a number. Greenfield projects skip baseline. The chart's progress bar fills from baseline → target.
9	Stop criteria	e.g., "score ≤ 0.05", "all checks pass", "user satisfied". Free text for the dashboard; the loop's automatic target-met stop reads `objective.target`. Skip the ask if it mirrors the target (target `0.05` + minimize → "score ≤ 0.05")
10	Initial ideas	≥ 3 distinct directions; push back on 1-2

Persist examples (use after Step 4's init; same set shape for the rest):

python3 "$STATE_PY" set "$STATE_HTML" project.objective.direction '"maximize"'
python3 "$STATE_PY" set "$STATE_HTML" project.objective.target 1.0
python3 "$STATE_PY" set "$STATE_HTML" project.baseline '{"score": 0.42, "notes": "linear regression baseline"}'
python3 "$STATE_PY" set "$STATE_HTML" project.time_limit_seconds 900
python3 "$STATE_PY" append-idea "$STATE_HTML" '{"title":"feature engineering","description":"add lag features","priority":"high"}' --added-by onboard

Step 3: Spawn a gap-detection subagent (in parallel with Step 2)

After you have the project name and objective description, launch a general-purpose subagent that reads existing files in $PWD, compares the user's objective against typical pitfalls (is the metric automatable? is the baseline reproducible? are the initial ideas concrete?), and returns ≤ 5 specific gaps with one suggested follow-up question each. Cap response at ~200 words.

Then ask the user one gap question at a time via AskUserQuestion. Persist each answer; stop when the user pushes back or all gaps addressed.

Step 4: Scaffold `.hillclimb/`

After Question 1:

mkdir -p "$PWD/.hillclimb"
python3 "$SRC_STATE_PY" init "$STATE_HTML" --name "<project name>"
cp "$SRC_STATE_PY" "$STATE_PY" && chmod +x "$STATE_PY"
cp "$SKILL_DIR/run_with_timeout.py" "$PWD/.hillclimb/run_with_timeout.py"
chmod +x "$PWD/.hillclimb/run_with_timeout.py"

After Question 6 (verifier template), wire it up. Pick metric.sh or custom.sh from $TPL_DIR/ based on the user's answer. Refuse to overwrite an existing verify.sh, if one is already there, ask the user before replacing.

TEMPLATE="metric.sh"   # or custom.sh, per the user's answer
DEST="$PWD/.hillclimb/verify.sh"
if [ -e "$DEST" ]; then
  echo "verify.sh already exists, leaving it in place"
else
  cp "$TPL_DIR/$TEMPLATE" "$DEST" && chmod +x "$DEST"
fi
python3 "$STATE_PY" set "$STATE_HTML" project.verifier.command '"bash .hillclimb/verify.sh"'

If the user picked "Other" (no template, custom command), skip the cp and set project.verifier.command to whatever shell command they supplied. The command must emit a single JSON line on stdout: {"status":"pass"|"fail"|"inconclusive","score":<num>,"notes":"..."}.

From here on use $STATE_PY (the project-local copy). That's what the other skills will use, so the smoke-test path matches.

Step 4b: Run the baseline (Question 8)

Run the verifier yourself and write the score to project.baseline. The user does not paste a number — reproducibility comes from us running the same command that hillclimb-verify will run on every iteration.

Continue mode: if project.baseline.score is already non-null when Step 4b begins (user is re-onboarding to add ideas, not rebaseline), skip this whole step. Re-running here would overwrite a real baseline captured against a different code state and silently invalidate every prior run's "improved?" call.

Otherwise, gate readiness via AskUserQuestion with three options:

"ready to run" — verify.sh is wired up and produces a real number.
"let me edit verify.sh first" — placeholder template, or user wants to tweak. Wait for the user to confirm they're done, then loop back and re-ask this same gate.
"greenfield (no baseline)" — nothing meaningful to measure yet.

Show the gate for all template choices, including Other. A user-supplied Other command can be just as unfinished as a template stub (e.g., it shells out to a script that isn't written yet); always let the user gate it.

For reference: the metric template ships with EVAL_CMD='python eval.py' and the custom template emits valid JSON with score: 0.0 from a stub — both would parse successfully and silently write a junk baseline if you skipped the gate.

If the user picks greenfield:

python3 "$STATE_PY" set "$STATE_HTML" project.baseline '{"score": null, "notes": "greenfield, no baseline"}'

If the user picks edit first, wait, then re-show the gate.

If the user picks ready to run, invoke the verifier exactly the way hillclimb-verify does (so the baseline is comparable to later runs). Time limit comes from project.time_limit_seconds, which Question 7 already set:

STATE_JSON=$(python3 "$STATE_PY" read "$STATE_HTML")
CMD=$(printf '%s' "$STATE_JSON" | python3 -c 'import json,sys; print(json.load(sys.stdin)["project"]["verifier"]["command"])')
TIMEOUT=$(printf '%s' "$STATE_JSON" | python3 -c 'import json,sys; print(json.load(sys.stdin)["project"].get("time_limit_seconds") or 600)')
python3 .hillclimb/run_with_timeout.py "$TIMEOUT" bash -c "$CMD" 2>&1

Run this as a single Bash tool call in the foreground (run_in_background: false). Pass the Bash tool's timeout ≈ TIMEOUT*1000 + 10000 ms. The verifier's combined stdout/stderr lands in the tool's output (call it RAW); the tool's reported exit code is the wrapper's exit (call it EXIT_CODE). Don't RAW=$(...) it inside the shell — the variable would be lost before you could read it.

Triage EXIT_CODE (mirrors hillclimb-verify Step 3). On any failure path, the agent picks one of three concrete writes and runs the matching state.py set — never the literal string "none":

`EXIT_CODE`	Action
`0`, parsed last-line JSON has a numeric `score`	`set project.baseline '{"score": <score>, "notes": "onboarding baseline run"}'` — success path.
`0`, no parseable JSON or no `score` field	Show the raw tail. `AskUserQuestion`: retry / enter manual score / skip baseline.
`124` (timeout)	Show the timeout message. `AskUserQuestion`: retry with a larger time_limit_seconds / enter manual score / skip baseline. Do not auto-write anything.
other nonzero	Show the raw tail. `AskUserQuestion`: retry after fixing verify.sh / enter manual score / skip baseline.

The three follow-up choices map to concrete writes:

retry: re-run the verifier block above (after the user finishes any fix or raises time_limit_seconds). Loop until success or the user picks another option.
enter manual score: set project.baseline '{"score": <num>, "notes": "manual entry; verifier did not produce a score"}'.
skip baseline: same write as greenfield, but with notes "baseline skipped; verifier failed at onboarding".

Never silently coerce a failed verifier into score: 0 — a wrong baseline poisons every future "improved?" check.

Step 5: Write `.hillclimb/.gitignore`

Ignore the dashboard plus scratch outputs. state.html is gitignored so git checkout/reset can never destroy run history; the dashboard lives only on disk.

cat > "$PWD/.hillclimb/.gitignore" <<'EOF'
state.html
*.tmp
*.log
__pycache__/
EOF

Step 6: Confirm summary

python3 "$STATE_PY" read "$STATE_HTML"

Render a compact bullet summary: project name, objective (description, direction, target, unit), baseline, stop criteria, verifier command, time limit (seconds / 60 minutes), ideas (id, title, priority).

Mark any field you inferred rather than asked with a trailing (inferred) tag in the summary, so the user knows where to look hard. A silently-wrong guess is the failure mode this step exists to catch.

Then one AskUserQuestion: confirm or revise a specific field. On revise, update via state.py set (or append-idea) and loop. Don't proceed until the user explicitly picks confirm.

Step 7: Hand off (under five lines)

Dashboard path: file://<cwd>/.hillclimb/state.html
Baseline result: the score we captured, or "skipped (greenfield)" / "skipped (manual)". If verify.sh wasn't run successfully, nudge the user to fix it before the first hillclimb-verify — changing it after a real baseline makes scores incomparable.
Note that .hillclimb/state.py and .hillclimb/run_with_timeout.py are project-local helpers, don't move or delete.
Next: hillclimb-execute skill to try the first idea.

Rules

One question at a time. Never bundle.
Persist immediately after each answer. A user who abandons mid-flow should leave a coherent partial setup.
Push back on vague ideas. "Try better hyperparameters" is not an idea; "Sweep learning rate over [1e-4, 1e-2] with cosine schedule" is.
Never edit state.html by hand. Always go through state.py.
Run the verifier exactly once here — for the baseline (Step 4b). Don't execute ideas; that's hillclimb-execute. Don't re-run the verifier to compare; that's hillclimb-verify.

hillclimb-onboard

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

hillclimb-onboard

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

hillclimb-onboard: set up a hill-climbing project

Step 1: Locate bundled assets, handle existing state

Step 2: Interview the user, ONE question at a time

Step 3: Spawn a gap-detection subagent (in parallel with Step 2)

Step 4: Scaffold `.hillclimb/`

Step 4b: Run the baseline (Question 8)

Step 5: Write `.hillclimb/.gitignore`

Step 6: Confirm summary

Step 7: Hand off (under five lines)

Rules

Similar Skills

hillclimb-onboard: set up a hill-climbing project

Step 1: Locate bundled assets, handle existing state

Step 2: Interview the user, ONE question at a time

Step 3: Spawn a gap-detection subagent (in parallel with Step 2)

Step 4: Scaffold `.hillclimb/`

Step 4b: Run the baseline (Question 8)

Step 5: Write `.hillclimb/.gitignore`

Step 6: Confirm summary

Step 7: Hand off (under five lines)

Rules

Similar Skills

hillclimb-onboard

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

hillclimb-onboard

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

hillclimb-onboard: set up a hill-climbing project

Step 1: Locate bundled assets, handle existing state

Step 2: Interview the user, ONE question at a time

Step 3: Spawn a gap-detection subagent (in parallel with Step 2)

Step 4: Scaffold .hillclimb/

Step 4b: Run the baseline (Question 8)

Step 5: Write .hillclimb/.gitignore

Step 6: Confirm summary

Step 7: Hand off (under five lines)

Rules

Similar Skills

hillclimb-onboard: set up a hill-climbing project

Step 1: Locate bundled assets, handle existing state

Step 2: Interview the user, ONE question at a time

Step 3: Spawn a gap-detection subagent (in parallel with Step 2)

Step 4: Scaffold .hillclimb/

Step 4b: Run the baseline (Question 8)

Step 5: Write .hillclimb/.gitignore

Step 6: Confirm summary

Step 7: Hand off (under five lines)

Rules

Similar Skills

Step 4: Scaffold `.hillclimb/`

Step 5: Write `.hillclimb/.gitignore`

Step 4: Scaffold `.hillclimb/`

Step 5: Write `.hillclimb/.gitignore`