From probabl-skills
Runs an EDA script to surface dataset shape, dtypes, missingness, cardinality, target balance, and feature associations before model design. Produces a persisted report and JOURNAL section that justify later learner/splitter/metric choices.
How this skill is triggered — by the user, by Claude, or both
Slash command
/probabl-skills:explore-ml-dataThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Understand the dataset before designing a model. One project-level
Understand the dataset before designing a model. One project-level
EDA per workspace: an executable data/eda.py, a persisted
data/eda.md narrative, rich data/eda_<table>.html reports, and a
short JOURNAL section that links them. The findings feed the baseline
design note's learner / splitter / metric choices.
| You came here for… | → next |
|---|---|
| Bootstrap, before the first baseline | → back to iterate-ml-experiment § 0; the EDA findings inform the auto-drafted 01_baseline.md |
| User free-text ("explore the data") | → surface the findings; no further dispatch unless the user asks to model |
| Re-understand a changed data source | → re-run, overwrite data/eda.*, refresh the JOURNAL EDA section |
Always re-emit the Pre-flight checklist with evidence before declaring the turn done.
EDA is a bootstrap-time gate (G-EDA) owned by this skill and
fired by iterate-ml-experiment § 0 before the baseline design
note. Ordering matters: the dataset facts (class balance, datetime /
group columns, missingness, cardinality) are exactly what justifies
the splitter (G-CV-SPLITTER), the metric default, and the learner
default. Running EDA after the model is designed defeats the purpose.
scaffold → JOURNAL → goal from data/README.md
│
└─► G-EDA (run | skip) ◄── this skill
│ run
└─► data/eda.py → execute → data/eda.md + HTML + JOURNAL §EDA
│
└─► auto-draft 01_baseline.md (cites the EDA findings)
Two locations are kept separate: the raw data source (read-only,
may live anywhere) and the EDA deliverables (always under
<project>/data/).
| Path | Durability | Who writes it | What it holds |
|---|---|---|---|
raw data source (data/, raw/, an absolute path, external) | user-owned, READ-ONLY | the user | The dataset. EDA reads it; never modifies it. May be anywhere — not assumed to be data/ |
data/eda.py | Durable (committed) | This skill, once per workspace | The jupytext # %% EDA cells. Source of truth. Openable as a notebook for the rich view |
data/eda.md | Durable (committed) | This skill (authored from the digest) | The prose narrative: findings + modelling implications that the baseline note cites |
data/eda_<table>.html | Durable (committed) | data/eda.py via TableReport.write_html(...) | The rich, interactive skrub report per table — for the human |
scratch/eda/eda.md | Ephemeral (gitignored), optional | run_cells.py when given a 2nd arg | Per-cell digest the agent reads. Same content as stdout |
journal/JOURNAL.md § Data understanding (EDA) | Durable (committed) | This skill | 2–4 line summary + link to data/eda.md |
Mnemonic: the raw data is read-only and lives wherever the user
keeps it; data/eda.py is source; data/eda.md + the HTML are the
durable deliverables, always under data/; scratch/eda/ and
stdout are the ephemeral run digest.
The central rule. Surfaced as the first Stop condition below.
Allowed — this skill writes ONLY (deliverables always under
<project>/data/, created if absent):
data/eda.py — the EDA script (created / overwritten in place).data/eda.md — the authored narrative.data/eda_<table>.html — the skrub TableReport pages.scratch/eda/ — the ephemeral digest.journal/JOURNAL.md § Data understanding (EDA).Forbidden:
data/, another
folder, an absolute/external path). EDA reads them; it never
rewrites them. Data cleaning is the pipeline's job
(build-ml-pipeline), declared at fit time, not a one-off mutation.src/<pkg>/
edits, no reports/ writes, no new experiment files.skore.evaluate(...), no project.put(...),
no learner selection here. EDA informs those; it does not make
them.data/eda.py reads the raw files
(wherever they live) and writes only the data/eda.* deliverables.<project>/data/; the raw source is
separate. Write data/eda.py / data/eda.md /
data/eda_<table>.html under <project>/data/ (create the folder
if absent). The raw data the script reads may live anywhere
(data/, another in-repo folder, an absolute or external path) —
decouple the two: a RAW = <LOAD_RAW_DATA> source vs an EDA_DIR
output. Never assume the raw data is in data/.journal/01_baseline.md is drafted. It is binary:
run (place + execute data/eda.py, write the deliverables) or
skip (record Status: skipped — <date> in the JOURNAL section
and proceed). Do not silently bypass — fire the AskUserQuestion.
Free-text "go fast" / "quick baseline" does NOT resolve it.ipython. If it is missing and the user chose run, STOP and
delegate to python-env-manager § "Agent feature"
(G-AGENT-FEATURE). Do NOT type pixi add ... ipython yourself;
do NOT fabricate EDA output with hand-written print()s. If the
user declines the agent feature, fall back to the skip path
(record Status: skipped) — never loop between run and install.skrub / pandas /
polars symbol (TableReport, TableReport.json, write_html,
column_associations, the tabular reader, …) must come from
python-api this turn. Cache hits under
scratch/api/<lib>/<version>/ count; inline memory does not.
TableReport.json()'s key names are not formally documented and
drift across skrub versions — confirm them via python-api and
parse defensively (.get(...)).select_dtypes doesn't even exist in polars). The
structured facts come from skrub (TableReport(...).json(),
column_associations), which accept both. The ONLY library-
specific line is RAW = <LOAD_RAW_DATA>. Do not write
df.isna()/df.nunique()/df.select_dtypes(...) etc.skrub.TableReport for dataframe overviews. Every table gets a
TableReport(RAW, title=..., verbose=0) written to
data/eda_<table>.html (the user-facing artifact) AND read via
.json() for the digest. verbose=0 keeps progress prints out of
the digest.TableReport. Outside a notebook,
repr(TableReport(df)) is the useless <TableReport: use .open() to display>. Use report.write_html(...) (a statement) for the
HTML, and end cells on text-friendly expressions (RAW.shape,
a dict/list built from report.json(),
skrub.column_associations(RAW)) so the digest carries real
values. Mirrors audit's .frame() rule.data/; ask about the inputs. The
deliverables live in data/ and must stay committable, so the
whole data/ folder must never be in .gitignore. If the raw
inputs should be kept out of git (large / local-only), fire an
AskUserQuestion offering to ignore specific input patterns
(e.g. data/raw/, data/*.parquet) — default: don't. Then verify
the deliverables are tracked (git check-ignore data/eda.md must
return nothing). Never auto-edit .gitignore — that is
organize-ml-workspace's to write; surface the patch and ask.data/eda.py covers the whole
dataset; multi-table data gets one TableReport cell per table
inside that one file (run the target/structure cells on the
target-bearing table). No eda_v2.py, no per-experiment EDA files,
not part of the four-way stem pairing. Re-understanding overwrites
data/eda.py in place.data/eda.md; the picks
happen in their owning gates (G-CV-SPLITTER, the baseline note).| Shortcut | Why it's wrong |
|---|---|
| Design the baseline first, EDA "later if there's time" | Inverts G-EDA. The point is to justify the modelling choices before making them. EDA runs first in bootstrap |
End a cell on a bare TableReport(df) to "show the report" | Outside a notebook that repr is <TableReport: use .open() to display> — zero signal in the digest. Use write_html(...) + a text summary built from report.json() |
print(...) instead of a bare summary expression | The runner captures bare last-expressions via result.result; print(...) lands in stdout and is harder to scan. Use bare expressions |
Use pandas/polars methods (df.isna(), df.nunique(), df.select_dtypes(...)) for the summaries | Breaks on the other library (polars has no select_dtypes). Read the facts off skrub (TableReport(...).json(), column_associations) — agnostic to pandas/polars |
Clean / impute / drop columns in data/eda.py and re-save the raw file | EDA is read-only against raw data. Cleaning belongs in the pipeline (build-ml-pipeline), applied at fit time for train/test consistency |
Assume the raw data is in data/ | The raw source may live anywhere; only the deliverables are pinned to data/. Set RAW = <LOAD_RAW_DATA> to wherever the data actually is |
Gitignore the whole data/ folder | The committed deliverables (data/eda.*) live there. Ignore only specific input patterns, and ask the user first |
| Run EDA without the agent feature by hand-writing the expected output | Fabricated EDA is worse than none. Missing runner → G-AGENT-FEATURE (install) or the skip path |
pixi add ipython directly from this skill | Install is owned by python-env-manager. This skill requests via G-AGENT-FEATURE |
Drop the authored data/eda.md and leave only the HTML | The .md carries the modelling implications the baseline note cites and the JOURNAL section links. Both are required |
| Invent column meanings not visible in the data | Report what the data shows. Domain semantics the user didn't state go in an explicit "open questions" list, not as asserted fact |
| Forget the JOURNAL § Data understanding update | The section is the index entry; without it later sessions can't find the EDA. It is part of "done" |
Pre-flight (explore-ml-data):
- [ ] Trigger: bootstrap-G-EDA | user-request | data-changed
Evidence: caller + rule that matched
- [ ] Detection: EDA already present? data/eda.md + JOURNAL §EDA
Evidence: ls / Glob on data/eda.md + Read JOURNAL §EDA
| "n/a — first EDA"
- [ ] G-EDA resolved: run | skip
Evidence: AskUserQuestion id=<id>, answer=<run|skip>
| user free-text quote turn N
If skip: JOURNAL §EDA records "Status: skipped — <date>"; STOP here.
- [ ] Tabular library known (G-TABULAR): pandas | polars
Evidence: JOURNAL.md Status (Workspace decisions) | AskUserQuestion
via data-science-python-stack
- [ ] Raw data located (may be outside data/): <paths / loader>
Evidence: ls / Glob on the data location + the RAW load call placed
in data/eda.py | user-quoted path turn N
- [ ] data/ not gitignored as a whole; deliverables will be tracked
Evidence: `git check-ignore data/eda.md` returns nothing
| AskUserQuestion id=<id> on ignoring specific inputs
| "n/a — no .gitignore yet"
- [ ] Agent feature available (run path only):
`pixi run -e agent ipython -c "print(0)"` exit 0
Evidence: tool output | JOURNAL.md Status `agent feature: installed`
Missing → STOP, delegate to python-env-manager G-AGENT-FEATURE
(decline → fall back to skip path)
- [ ] python-api consulted for symbols used:
skrub.TableReport, TableReport.write_html, TableReport.json,
skrub.column_associations, the tabular reader (load cell only)
Evidence: Read/Write scratch/api/<lib>/<version>/<topic>.md (this turn)
| "n/a — cache hit + Read this turn"
- [ ] Template copy + substitution decided:
<pkg> → package name from src/<pkg>/
<LOAD_RAW_DATA> → the real loader, pointing wherever the data lives
<TARGET_COLUMN> → the target (from goal / data/README.md), or n/a
<table> → short slug per table for eda_<table>.html
Evidence: Read templates/eda.py this turn before Write data/eda.py
- [ ] Execution command shape confirmed:
pixi run -e agent python \
.agents/skills/audit-ml-pipeline/scripts/run_cells.py \
data/eda.py [scratch/eda/eda.md]
Evidence: command emitted before running
- [ ] Deliverables written: data/eda.md (prose + implications),
data/eda_<table>.html (≥1), JOURNAL §Data understanding
Evidence: Write of each | "n/a — skip path"
- [ ] Pre-flight re-emitted with evidence before final message.
Evidence: this checklist appears in the end-of-turn summary.
data/eda.py is jupytext percent format (# %%), executed by
the shared runner. Template: templates/eda.py. Full cell-by-cell
anatomy with right / wrong shapes: → references/cell_anatomy.md.
| Placeholder | Replaced with |
|---|---|
<pkg> | The importable package name (from src/<pkg>/); used for from <pkg> import PROJECT_ROOT (only to locate EDA_DIR = PROJECT_ROOT / "data") |
<LOAD_RAW_DATA> | The real load of the raw file(s), pointing wherever the data lives (in data/, another folder, an absolute path, or external). Uses the workspace tabular lib (pandas/polars); skrub accepts both. The one library-specific line |
<TARGET_COLUMN> | The target column name (from the goal / data/README.md), or remove the target cell if unsupervised / unknown |
<table> | A short slug per table for the HTML filename (eda_<table>.html) — for a single table use the dataset name |
Brief outline; concrete examples → references/cell_anatomy.md.
import json, import skrub,
from <pkg> import PROJECT_ROOT, EDA_DIR = PROJECT_ROOT / "data"
(+ EDA_DIR.mkdir(parents=True, exist_ok=True)). No pandas/polars
import here.RAW = <LOAD_RAW_DATA>
pointing wherever the data lives; end on RAW.shape.report = skrub.TableReport(RAW, title=..., verbose=0); report.write_html(EDA_DIR / "eda_<table>.html"); then summary = json.loads(report.json())
and end on a dict/list of per-column dtype / null / cardinality
facts. One such cell per table.summary["columns"]; it carries value counts
(classification) or a distribution summary (regression). Drives the
metric default and whether the splitter should stratify.G-CV-SPLITTER choice
(TimeSeriesSplit / GroupKFold).skrub.column_associations(RAW) to flag strong predictors and
possible leakage.data/eda.md + the JOURNAL section from this digest.write_html(...) is load-bearing on the overview cells (the human
artifact). verbose=0 and the bare report.json()-derived
expressions are load-bearing for a clean, library-agnostic digest.
For multi-table data, run cells 5–7 on the target-bearing table; for
very large data, load a row sample (see references/cell_anatomy.md).
pixi run -e agent python \
.agents/skills/audit-ml-pipeline/scripts/run_cells.py \
data/eda.py
The runner (shared with audit-ml-pipeline) streams the digest to
stdout — the agent reads it directly from the bash tool output. Pass
a second arg scratch/eda/eda.md to also write the digest to a file.
For non-pixi workspaces, swap the activation prefix per
python-env-manager § "Agent feature".
This skill ships no runner of its own — there is no
explore-ml-data/scripts/. Always invoke the shared
audit-ml-pipeline/scripts/run_cells.py at the path above; don't
look for or fork a local copy.
Prerequisites for the run path: the workspace package must be
importable (from <pkg> import PROJECT_ROOT — editable install done
during scaffold) and skrub installed (Tier 1). If either import
fails, the digest shows the ImportError; route to
python-env-manager for the missing piece rather than working around
it.
data/eda.py, re-run,
re-author data/eda.md + HTML, refresh the JOURNAL section.scratch/eda/ is overwritten on every run. The durable record is
data/eda.py + data/eda.md + git history.data/eda.mdAfter the run, read the digest and write data/eda.md from
templates/eda.md. It is prose, grounded in the digest — no invented
facts. Required sections:
StratifiedKFold + look at ROC-AUC / PR-AUC,
not accuracy"; "user_id repeats across rows → consider
GroupKFold"; "timestamp present → TimeSeriesSplit if forecasting".
These are implications, not decisions — the gates own the picks.Link each data/eda_<table>.html from the relevant section.
iterate-ml-experiment's JOURNAL.md carries a top-level
## Data understanding (EDA) section (placed right after ## Status). This skill owns its content:
## Data understanding (EDA)
- **Status:** done — <YYYY-MM-DD> <!-- or: skipped — <YYYY-MM-DD> -->
- **Summary:** <2–4 lines: dataset shape, target balance/skew, the
one or two findings that most shape the modelling choices>
- **Report:** [data/eda.md](../data/eda.md)
Keep it to a few lines — it is an index entry, not the report. The
detail lives in data/eda.md. On the skip path, only the
Status: skipped line is required.
| Caller | When |
|---|---|
iterate-ml-experiment § 0 bootstrap | Automatic; G-EDA fires before the baseline design note |
| User free-text | "explore the data", "do an EDA", "profile the dataset" — resolves directly |
| Callee | Why |
|---|---|
python-env-manager § Agent feature | When ipython is missing on the run path — G-AGENT-FEATURE |
python-api | Every skrub / pandas / polars symbol. Cache hits first |
data-science-python-stack | G-TABULAR (pandas / polars) if not yet recorded; skrub TableReport reference |
python-code-style | After writing data/eda.py — ruff format / check + contextualize the comments to this dataset (strip any leftover workflow/process prose) |
build-ml-pipeline /
evaluate-ml-pipeline / iterate-ml-experiment).src/<pkg>/ or the experiment / audit files.ipython / pyright (python-env-manager owns).| Skill | Relationship |
|---|---|
iterate-ml-experiment | Caller. § 0 fires G-EDA before the baseline note; the EDA findings seed the note's Method / Risks |
audit-ml-pipeline | Owns the shared cell runner scripts/run_cells.py this skill executes; same bare-expression discipline |
organize-ml-workspace | Workspace layout; data/ is user-owned — this skill is the one exception that writes data/eda.* into it |
python-env-manager | Agent feature install (G-AGENT-FEATURE). This skill requests; that skill installs |
python-api | skrub / pandas / polars symbol lookups. Cache hits first |
data-science-python-stack | G-TABULAR; skrub TableReport is catalogued there |
python-code-style | ruff after writing data/eda.py |
templates/eda.py — the data/eda.py skeleton. Copy + substitute;
don't rewrite from memory.templates/eda.md — the data/eda.md report skeleton.The cell runner is not owned here — it is
audit-ml-pipeline/scripts/run_cells.py (shared). Don't fork it.
references/cell_anatomy.md — concrete cell examples (right /
wrong shapes), the TableReport repr trap, the full cell
sequence, and how each finding maps to a downstream gate.npx claudepluginhub probabl-ai/skills --plugin probabl-skillsOrganizes ML experimentation projects with a standard layout: `src/<pkg>/` for reusable code, `experiments/` for per-experiment `# %%` scripts, `journal/` for design notes, `reports/`, and `scratch/`. Handles scaffolding, file-creation rules, and redirects `.ipynb` users to Jupytext scripts.
Phase 2 of the /ds workflow — profiles data and creates a task breakdown from a spec. Requires SPEC.md from ds-brainstorm.
Processes and analyzes data quality for ML research. Handles cleaning, missing values, feature engineering, augmentation, splitting, and dataset creation.