From probabl-skills
Drives the ML experiment iteration loop: propose, design, approve, implement, and record outcomes via journal/JOURNAL.md and per-experiment design notes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/probabl-skills:iterate-ml-experimentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The loop on top of `experiments/`: what to try next, why, what
The loop on top of experiments/: what to try next, why, what
counts as a result, how the trail is recorded. Pipeline / evaluation
mechanics live in sibling skills.
session open
│
├── JOURNAL.md missing / placeholder ──► § 0 Bootstrap
│ │
│ ├─► G-EDA (explore-ml-data: run | skip)
│ │
│ └─► design note → G-DESIGN → § 3 implement
│
├── "what's next?" with ≥1 done row ───► § 1 → § 2 (sourcing) → § 3 implement
│
├── "run finished" ─────────────────────► § 4 record outcome
│ │
│ └─► dispatch audit-ml-pipeline
│
└── "status?" / "compare X Y" ──────────► references/maintenance_modes.md
Always re-emit the Pre-flight checklist with evidence before declaring the turn done.
Open each sibling SKILL.md just-in-time when a step calls for
it (e.g. open evaluate-ml-pipeline before § 3's CV-strategy
step). Do not pre-read all at session start.
Sibling skills (just-in-time):
- organize-ml-workspace, data-science-python-stack,
python-env-manager, python-api, python-code-style,
explore-ml-data, build-ml-pipeline, evaluate-ml-pipeline,
test-ml-pipeline, smoke-test-ml-pipeline,
iterate-from-skore / iterate-from-user
Then before answering:
journal/JOURNAL.md. Missing/placeholder → bootstrap (§ 0).
This is the canonical project digest (Status, Data understanding
(EDA), History, Backlog).Workspace decisions block for pre-recorded gates
(tabular, env_manager, package, skore_mode, cv_splitter) — a
recorded decision skips its AskUserQuestion.You read one mode section per turn. Match the user's signal, then jump.
| Signal / workspace state | Mode | Section |
|---|---|---|
JOURNAL.md missing / placeholder / 0 History rows | Bootstrap | § 0 |
journal/ not scaffolded (no src/, no experiments/) | Bootstrap → handoff first | → organize-ml-workspace, then § 0 |
| "what's next?" / "let's iterate" / "propose next" — with ≥1 done row | Iterate (propose) | §§ 1–3 + Dispatch table |
| "the run finished" / "log the result" / "we got X = …" | Iterate (record) | § 4 |
| "where are we?" / "status?" / "what have we tried?" | Project overview | references/maintenance_modes.md § "Project overview" |
| "compare X and Y" / "X vs Y" | Compare (read-only) | references/maintenance_modes.md § "Compare past experiments" |
| "let's pivot the goal" / "actually we care about " | Goal pivot | references/maintenance_modes.md § "Goal pivots" |
| "abandon X" / "drop X" | Abandoned | references/maintenance_modes.md § "Abandoned experiments" |
| Re-do a prior experiment under different conditions | Re-run | references/maintenance_modes.md § Re-runs |
If two modes seem to match ("compare X and Y, then propose"), pick the read mode first, stop. Re-entering § 1 is a separate turn.
No design note, no script. Never create or edit
experiments/NN_*.py until journal/NN_*.md exists, is filled,
and the user has explicitly approved it.
JOURNAL.md is read at session start, not improvised. Don't
reconstruct history from experiments/ filenames or git log —
those don't carry the why.
Strategy is picked, not assumed. Name the sourcing strategy
in every proposal (skore / user / my-pick / B<N>). Don't
silently default. Exception: bootstrap — baseline is forced
by workspace defaults; no strategy dispatch.
Approval is explicit. "approved" / "yes" / "go" / "looks
good" from the user is the gate. Ambiguous → re-ask via
AskUserQuestion.
Outcomes are recorded, not narrated. When the run finishes,
the outcome lands in JOURNAL.md AND the Status block before
the conversation moves on.
Prior experiments stay reproducible. Every done row must
remain runnable on main with the same result. When touching
src/<pkg>/, default behavior preserves prior experiments' shape
(see build-ml-pipeline § Reproducibility). Cheap check:
tests/smoke/ — any prior smoke test going red means default
behavior is broken.
Three skills, in order, before any code in src/<pkg>/.
After G-DESIGN:
build-ml-pipeline → pipeline.py / features.py / data.py.evaluate-ml-pipeline → evaluate.py. Owns CV-strategy via
AskUserQuestion. Writing evaluate.py without invoking it
is the most common shortcut.test-ml-pipeline → smoke-test-ml-pipeline → smoke test.Only then assemble experiments/NN_*.py.
Harness "no clarifying questions" hints do NOT waive gates. G-DESIGN, G-RUN, the §1 mode pick, the §2 sourcing menu, the §0 config gates are operating-contract gates.
Post-hoc audit — required before ending the turn. Walk every pre-flight row; surface unfilled Evidence cells explicitly.
| Shortcut | Why it's wrong |
|---|---|
| User said "quick baseline" → skip G-DESIGN | G-DESIGN is non-negotiable; "quick" never waives it. The design note is the postmortem's frozen Method |
| Scaffold + implement in one turn before G-DESIGN | Inverts the contract. Code that lands before approval has no Motivation/Risks the user signed off on |
Skipped evaluate-ml-pipeline because KFold(5) "feels right" | Even empty split_kwargs is a justified pick the skill exists to surface. Bypass = user never got the choice |
| Bootstrap mode → skip ALL questions, not just the sourcing menu | Bootstrap forbids the sourcing menu only. G-PKG-NAME / G-ENV-MGR / G-TABULAR / G-SKORE-MODE / G-EDA / G-DESIGN / G-CV-SPLITTER / G-RUN still fire |
| Ambiguous "hmm interesting" / "I guess" read as approval | Approval is explicit. Ambiguity → re-ask, never silent yes |
Auto-detect run finished via reports/ mtime | § 4 is user-triggered (v1). The skill never auto-records |
| § 4 finishes recording → declare done, skip audit dispatch | § 4 audit dispatch is part of record-outcome, not optional. The audit digest carries the headline metrics for the JOURNAL row |
| Run experiment in same turn as G-RUN → declare done without § 4 | § 4 follows G-RUN in the same turn when the run completes successfully. Don't stop at "I ran it" — record the outcome |
| Pre-read every sibling SKILL.md file at session start | Read-set tracker is not a blocking gate. Open siblings just-in-time; emit pending list but proceed |
Compact checklist; Evidence-format spec in
references/preflight_evidence.md.
Pre-flight (iterate-ml-experiment):
- [ ] `journal/JOURNAL.md` read this turn (or confirmed missing → bootstrap)
Evidence: Read journal/JOURNAL.md (this turn) | "missing — bootstrap"
- [ ] `Workspace decisions` block checked for pre-recorded gates
Evidence: lists each <gate>: <value | not recorded>
- [ ] Mode: bootstrap | iterate-propose | iterate-record |
overview | compare | goal-pivot | abandoned | re-run
Evidence: rule that matched (Mode picker row)
- [ ] Last experiment + status: <NN_name> | n/a — bootstrap
Evidence: last row of JOURNAL.md History
- [ ] (Iterate-propose only) Sourcing menu presented; user picked
Evidence: AskUserQuestion id=<id>, answer=<skore|user|my-pick|B<N>>
| user free-text quote turn N
| "n/a — bootstrap / read-only mode"
- [ ] (Bootstrap only) Upfront config gates fired (G-PKG-NAME,
G-ENV-MGR, G-TABULAR, G-SKORE-MODE)
Evidence: per-gate ask id OR JOURNAL.md Status reference
| "n/a — iterate mode"
Note: G-CV-SPLITTER is NOT an upfront gate — it fires later, in
the § 3 chain at the evaluation step (after G-DESIGN).
- [ ] (Bootstrap only) G-EDA fired BEFORE the baseline draft
Evidence: explore-ml-data dispatched; answer=<run|skip>;
JOURNAL.md `## Data understanding (EDA)` section present
| "n/a — iterate mode"
- [ ] Design note drafted (or Backlog enriched, for `skore`)
Evidence: Write journal/<NN>_<name>.md (this turn) | "Backlog
rows B<x>..B<y> appended" | "n/a — read-only mode"
- [ ] G-DESIGN: user approved before any `experiments/NN_*.py` touched
Evidence: AskUserQuestion id=<id>, answer=approved | user quote |
"n/a"
- [ ] (§ 3 only) Three-skill chain ran in order:
build → evaluate → test
Evidence: each owning skill produced its file this turn
| "n/a outside § 3"
- [ ] (§ 3 only) G-CV-SPLITTER resolved during the evaluate step
Evidence: evaluate-ml-pipeline fired the splitter AskUserQuestion
(or mapped split_kwargs) before `evaluate.py` write
| "n/a outside § 3"
- [ ] (§ 3 only) G-RUN resolved: run now | leave for later
Evidence: AskUserQuestion id=<id> | "n/a outside § 3"
- [ ] (§ 4 only) All artifacts written: Status block + JOURNAL row +
Backlog hygiene + audit dispatch
Evidence: list each artifact written | "n/a outside § 4"
- [ ] python-api consulted for any new external symbol
Evidence: Read/Write scratch/api/<lib>/<v>/<topic>.md (this turn)
| "n/a — only re-using cached symbols"
- [ ] Pre-flight re-emitted with evidence before final message.
Evidence: this checklist appears in the end-of-turn summary.
Workspace is in bootstrap mode when journal/JOURNAL.md is missing,
placeholder, or has 0 History rows.
Procedure (compact — full version in references/bootstrap.md):
src/ / experiments/ /
journal/ → hand off to organize-ml-workspace, return when
the placeholder JOURNAL.md exists.JOURNAL.md from templates/JOURNAL.md.data/README.md before
asking. Propose one sentence; user confirms or amends.explore-ml-data. The gate is binary (run / skip); on
run it executes data/eda.py, writes data/eda.md + HTML, and
fills the ## Data understanding (EDA) JOURNAL section. The
findings (target balance / skew, datetime / group columns,
missingness, cardinality) feed the next step's learner and metric
defaults and inform the CV strategy chosen later at the evaluation
step. The run path needs the agent feature (ipython) and may
trigger G-AGENT-FEATURE here, before the baseline; if the user
declines it, EDA falls back to skip. On skip, the JOURNAL
section records Status: skipped.journal/01_baseline.md via the consultation
chain, informed by the EDA findings: learner default
(build-ml-pipeline) and metric default (python-api on
skore.evaluate). Do NOT fix a splitter here — the
cross-validation strategy is data-driven and decided at the
evaluation step (G-CV-SPLITTER, owned by evaluate-ml-pipeline)
once the pipeline's X-marker exists; the note simply records that it
is decided then. Conflicts with the EDA findings or the goal → flag
in Risks, don't override.Skipped: sourcing menu, § 1 resume/record/propose pick.
Still fires:
| Gate ID | Picks | Owner | Fires |
|---|---|---|---|
G-PKG-NAME | src/<pkg>/ import name | organize-ml-workspace | before manifest creation |
G-ENV-MGR | Env manager | python-env-manager | before any install command |
G-TABULAR | Tabular library (pandas / polars) | data-science-python-stack | before data.py write |
G-SKORE-MODE | Skore Project mode (local / hub / mlflow) + hub workspace name or MLflow tracking URI | organize-ml-workspace | before pyproject.toml write |
G-EDA | Explore the data (run / skip) before the baseline is designed | explore-ml-data | before the journal/01_baseline.md draft |
G-AGENT-FEATURE | Install ipython + pyright (install / skip) | python-env-manager | conditional — when G-EDA = run and the agent feature isn't present (else first audit at § 4) |
G-DESIGN | User approval of journal/01_baseline.md | this skill | before any src/<pkg>/ or experiments/ code — i.e. before the § 3 chain |
G-CV-SPLITTER | CV family for skore.evaluate | evaluate-ml-pipeline | inside the § 3 chain, AFTER G-DESIGN — at the evaluate step, before evaluate.py write; mandatory even with empty split_kwargs |
G-RUN | "run now" vs "leave for later" | this skill | before executing the experiment script |
Free-text "quick baseline" / "you pick" do NOT resolve any of
these — fall through to structured AskUserQuestion.
→ next: G-DESIGN, then § 3 implementation chain.
Read JOURNAL.md.
Summarize to the user in 2–3 lines: dataset, goal, last experiment + status, what's ripe in Backlog.
Ask via AskUserQuestion — three options, no silent default:
done or abandoned; → § 2.Free-text "let's keep going" / "yeah" is ambiguous — wait.
Every time § 2 runs in iterate mode, surface this menu with the JOURNAL.md Backlog table. Never silently default.
How would you like me to source the next experiment?
skore — read the audit digest at scratch/audit/<stem>/audit.md
from the latest run; follow each surfaced check's
documentation_url to draft a Backlog row, summarize,
re-present this menu.
user — you tell me what to try: article URL, GitHub issue,
spec / reference repo, or free text.
my-pick — I synthesize 2–4 candidate ideas; you pick one.
B<N> — promote a Backlog row directly.
Backlog (pick by index):
<paste JOURNAL.md Backlog table here>
Use AskUserQuestion for the pick. Plain-text enumeration only if
unavailable.
| User said… | Resolves to |
|---|---|
Exact label (skore / user / my-pick / B<N>) | that pick |
B2 / "let's do B2" | B<N> pick |
| Scientific article URL pasted | user → article-link branch |
GitHub issue URL / org/repo#N / spec path | user → resource-link branch |
| "give me ideas" / "you decide" | my-pick |
| "let me try X" / "use Y instead" | user → free-text branch |
| Ambiguous / off-menu | fire AskUserQuestion, don't guess |
skore → dispatch to iterate-from-skore. Returns
Backlog-candidate rows + summary. Write rows with stable B<N>,
surface summary, re-present sourcing menu. No design note
this turn.user → dispatch to iterate-from-user. Returns a Proposal.
Draft into journal/NN_short_name.md.my-pick → handled inline. Read JOURNAL.md Status, last
Implication / Risks, current Backlog. Synthesize 2–4 candidates,
present via AskUserQuestion. Draft the design note on pick.B<N> → promote the row. The row's Item becomes the seed;
the row's Source becomes Sourcing strategy. Remove from
Backlog on approval.For user / my-pick / B<N>: write draft to
journal/NN_short_name.md using templates/experiment_design.md.
NN is the next free integer; short_name is the user's call.
→ next: § 3.
Surface the draft: file path + 3–5 line summary (Question / Method / Risks).
Mid-iteration feedback is free-text. Edit journal/NN_*.md
in place; loop here.
Final approval gate is AskUserQuestion with two options:
Clear free-text "approved" / "go" / "looks good" resolves; ambiguous → structured ask.
Do not create experiments/NN_*.py during design iteration.
Track provenance honestly. Risks-only edits keep the original
Sourcing strategy. Method changes → <original> + user override.
After G-DESIGN passes, dispatch in order:
build-ml-pipeline → src/<pkg>/{pipeline,features,data}.py.evaluate-ml-pipeline → src/<pkg>/evaluate.py. Owns the
CV-strategy via AskUserQuestion. Bypassing is the named
forbidden shortcut.test-ml-pipeline → smoke-test-ml-pipeline → matching smoke
test at tests/smoke/test_NN_<short_name>.py.Only then assemble experiments/NN_*.py. Confirm signatures via
python-api, not memory.
Once tests/smoke/ passes (the new test AND every prior one):
ask via AskUserQuestion:
pixi run python experiments/NN_<short_name>.py.No silent default.
→ next: if the run completed in this turn, continue immediately to § 4. Don't stop at "I ran it" — record the outcome.
Trigger: user says "the run finished" / "log it", OR the
agent ran the experiment in the same turn (G-RUN = run now) and it
completed successfully. Do NOT auto-detect via reports/ mtime
or polling for runs the user kicked off themselves.
audit-ml-pipeline to place +
execute audit/NN_<short_name>.py. The audit reads the report
read-only via the bundled runner and streams a markdown digest
that carries the headline metrics. The audit replaces scratch
probes — don't write scratch/<ts>_inspect_*.py to extract
metrics from the report when the audit is the canonical path.
python-env-manager § Agent feature
(G-AGENT-FEATURE).journal/NN_*.md:
done (or abandoned with one-line reason).RMSE 0.083 ± 0.004 (5-fold CV)).done — all tests/smoke/
must pass. Prior failures = reproducibility regression → route
to build-ml-pipeline § Reproducibility. The CV report can
still land in skore Project, but the JOURNAL row stays
approved until full smoke suite is green. Abandonment doesn't
require passing smoke.JOURNAL.md History.~~old~~ — resolved in NN_X).
Diagnostic mining of the new report is iterate-from-skore's
job, not § 4's.Source is a GitHub issue, ask via AskUserQuestion whether to
gh issue comment <N> with the headline. Never auto-post.Stop here. Do NOT auto-propose the next experiment in the same
turn. Surface the implication, ask via AskUserQuestion:
The user controls cadence; this skill records, it doesn't propose-and-record in one breath.
| Situation | Action |
|---|---|
| No prior experiment (bootstrap) | § 0 forces auto-drafted baseline. No strategy skill |
User names a Backlog row (B2, "let's do B5") | Promote directly; no strategy skill |
| "mine the report" / "what does skore see?" | iterate-from-skore — enriches Backlog, re-shows menu. No design note this turn. |
| "I want to try X" / article URL / GitHub issue | iterate-from-user — three-branch ask. If free-text already resolved, pass pre-resolved branch |
| "give me ideas" / "you decide" | my-pick — handled inline. Synthesize 2–4 candidates, AskUserQuestion |
| Open-ended "what's next?" with ≥1 recorded experiment | Present sourcing menu verbatim + Backlog. No silent default |
The strategy skills are intentionally shallow: they source, this
skill drafts. The skore strategy requires a prior experiment
with an on-disk report — bootstrap can't use it.
If iterate-from-skore returns zero candidates: append a
one-liner to JOURNAL Status (Audit checks clean on <stem> as of <date> or Audit digest inaccessible on <stem> as of <date>).
No History row. Re-present sourcing menu.
Each is read-only or rare. Full procedures in
references/maintenance_modes.md:
AskUserQuestion(abandon / defer
/ run now). Status becomes abandoned with one-line reason.NN_<stem>_rerun) or batch
(NN_paired_comparison). New design note; original notes
unchanged.journal/
├── JOURNAL.md # status + history + backlog (index)
├── 01_baseline.md # design note for experiments/01_baseline.py
├── 02_<short_name>.md
└── …
Pairing rule (hard, four-way): journal/NN_<short_name>.md ↔
experiments/NN_<short_name>.py ↔
tests/smoke/test_NN_<short_name>.py ↔
audit/NN_<short_name>.py, identical stems, 1:1.
JOURNAL.md shapedata/eda.md. Owned by explore-ml-data (written at the G-EDA
bootstrap step); this skill only reserves the section.#,
Item, Source (skore:<stem> / my-pick:<stem> / user).Template: templates/JOURNAL.md. These four are the only sanctioned
sections — don't invent others.
Template: templates/experiment_design.md. Sections:
build-ml-pipeline / evaluate-ml-pipeline.planned → approved → done | abandoned.No "Success criteria" section. The user judges post-run.
explore-ml-data owns the G-EDA
step and the ## Data understanding (EDA) section).evaluate-ml-pipeline +
python-api).pipeline.py / features.py / data.py
(build-ml-pipeline).organize-ml-workspace).| Skill | Relationship |
|---|---|
organize-ml-workspace | Scaffold + stem-pairing rule |
explore-ml-data | § 0 fires G-EDA before the baseline; the EDA findings seed the baseline note's Method / Risks and the ## Data understanding (EDA) JOURNAL section |
iterate-from-user | User-sourced proposals (article / resource / free text) |
iterate-from-skore | Report-sourced Backlog enrichment |
build-ml-pipeline | pipeline.py / features.py / data.py body; reproducibility mechanics |
evaluate-ml-pipeline | evaluate.py body; CV-strategy decision; report inspection |
test-ml-pipeline → smoke-test-ml-pipeline | Smoke-test body; § 4 won't flip done until smoke is green |
audit-ml-pipeline | § 4 dispatch; audit digest carries the headline metrics for the JOURNAL row |
python-api | Signature lookups |
python-env-manager | G-AGENT-FEATURE for audit AND explore-ml-data (EDA) prerequisites |
references/bootstrap.md — full bootstrap procedure, config-gate
details, baseline-template substitution.references/record_outcome.md — full § 4 procedure with Backlog
hygiene examples, GitHub comment template.references/maintenance_modes.md — overview / compare /
goal-pivot / abandoned / re-runs with full procedures.references/preflight_evidence.md — Evidence-format spec.templates/JOURNAL.md — four-section index skeleton.templates/experiment_design.md — design note with Status block.Copy, don't rewrite.
npx claudepluginhub probabl-ai/skills --plugin probabl-skillsOrganizes ML experimentation projects with a standard layout: `src/<pkg>/` for reusable code, `experiments/` for per-experiment `# %%` scripts, `journal/` for design notes, `reports/`, and `scratch/`. Handles scaffolding, file-creation rules, and redirects `.ipynb` users to Jupytext scripts.
Maintains persistent ML experiment journals in Markdown files, logging hypotheses, changes, results, metrics, and learnings across sessions.
Turns model work into production ML systems with data contracts, repeatable training, quality gates, deployable artifacts, and monitoring. Useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLMs, anomaly detection, and batch analytics.