By probabl-ai
Run a structured ML experimentation loop: from EDA and pipeline declaration to evaluation, smoke testing, and audit-driven backlog generation. Use skrub DataOps graphs, skore reports, and a prescribed Python stack to keep experiments reproducible and documented.
Owns the `audit/` folder: one `# %%` (jupytext percent) Python file per experiment, aligned 1:1 with `experiments/NN_<short_name>.py` and `journal/NN_<short_name>.md`, that loads the experiment's skore report **read-only** and uses bare-last-expression cells whose `__repr__` carries the audit's signal. The agent executes the audit file via the bundled in-process runner (`audit-ml-pipeline/scripts/run_cells.py` — IPython `InteractiveShell.run_cell`), which streams a markdown digest of each cell's stdout + last-expression repr to stdout (optionally also to a file). The digest fuels narrative work (the `JOURNAL.md` Status + History update, follow-up questions about a past experiment, cross-experiment comparison). Stops at "audit/NN_*.py is placed, executed, and the digest is available." Never calls `skore.evaluate(...)` or `project.put(...)`. TRIGGER — any of: - `iterate-ml-experiment` § 4 record-outcome — audit is dispatched FIRST (replaces scratch probes for metric extraction). - The user asks "audit experiment 02", "show me what 03 looks like", "re-audit 04 against the new report". - An experiment was re-run (same `put()` key overwritten) and the matching audit file needs re-execution. - The user wants a human-readable narrative of a past experiment without firing the full `iterate-from-skore` flow. SKIP when: the design note isn't approved yet (route to `iterate-ml-experiment`); the experiment hasn't been run (no report on disk); the agent feature isn't installed (delegate to `python-env-manager` § "Agent feature"); the user is mining the report to source the *next* experiment (`iterate-from-skore`); the user wants to explore the **raw dataset** rather than a finished run's skore report (`explore-ml-data` — audit reads a report, not the data). HOW TO USE: confirm the four-way stem pairing exists (`journal/NN_*.md` approved + `experiments/NN_*.py` exists + smoke test passed + report under that key in the Project), then place `audit/NN_<short_name>.py` from `templates/audit.py`, substituting the package name + the literal Project init block copied from `experiments/<stem>.py`. Execute via the bundled runner: `pixi run -e agent python .agents/skills/audit-ml-pipeline/scripts/run_cells.py audit/<stem>.py`. **Read the Stop conditions and emit the Pre-flight checklist before any write or shell command.** Always invoke `python-api` for skore symbol signatures — never write them from memory.
Declare the pipeline from data source to predictor as a **skrub DataOps graph** (not as a bare `sklearn.Pipeline`). Every step is either a pure-Python function (stateless) attached via `.skb.apply_func`, or a sklearn-compatible estimator (stateful) attached via `.skb.apply`. Stops at the declared object — no fit, split, tuning, persistence, or evaluation. TRIGGER — any of: - Writing or editing code that declares any link in the chain *data source → predictor*: loaders, preprocessing, encoders / imputers / scalers, feature steps, composition objects (`Pipeline`, `ColumnTransformer`, skrub `tabular_pipeline`, `nn.Module`), or the final estimator. - A pure-Python data-processing function destined for the pipeline path (cleans / derives / reshapes) — whether wrapped via `FunctionTransformer`, `skrub.@deferred` / `skrub.var`, a custom `BaseEstimator` subclass, or just called in the training path before the estimator. - A step is added, removed, swapped, or reordered inside an existing pipeline declaration. - A bare `sklearn.Pipeline` / `make_pipeline` is being used as the top-level — fire to redirect into a skrub DataOps graph. - The user asks to build / declare / set up a pipeline / classifier / regressor for X. SKIP when: `.fit(...)` calls / training loops / `Trainer.fit` / epoch loops; train/test split or cross-validation splitting; hyperparameter search; persistence (`joblib.dump`, checkpointing); evaluation / metrics / scoring; inference over a pre-trained model; pure EDA; library-choice questions with no concrete declaration in play. HOW TO USE: consult before the first declarative line and on every structural edit (added/swapped step, changed input columns, changed estimator family). Don't re-consult for cosmetic edits. **First, read the Stop conditions and emit the Pre-flight checklist as visible text before any code.** Always invoke `python-api` to confirm skrub / sklearn symbol names and signatures before typing — don't guess from memory.
Opinionated Python stack for data-science / ML work — one library per job, organized into tiers (mandatory / user choice / optional / transitive). SKILL.md is the index; per-library `references/<library>.md` files carry scope, "pick this when" / "pick something else when", and pairings. TRIGGER when (any of these): (1) **a library import fails** in this stack's domain — the answer is install, not substitute (see § "Missing dependency"); (2) **a library choice has to be made** — explicitly (the user asks "which library for X?") or implicitly (code is about to introduce a new dependency, or the project is being scaffolded and the tabular library hasn't been picked yet); (3) starting a new Python data-science / ML project; (4) the user or current code reaches for a substitute outside the stack (xgboost, lightgbm, black, isort, flake8, poetry, hatch), or reaches for `mlflow` to log params/metrics, or for `cross_val_score` + handwritten reporting — redirect: tracking → `skore` Project API, evaluation / reporting → `skore` report classes, `mlflow` stays only for model serving / registry. SKIP when: the project is non-Python; the work is web / backend / infra unrelated to data science; the library is already chosen and installed and the task is implementation inside it (bug fix, feature work, refactor) with no new dependency in play. HOW TO USE: **read this SKILL.md end-to-end before recommending or installing anything** — picking from a single index entry hides the tier (whether the library is mandatory, a user-choice, optional, or already transitively present) and the pairings, and both matter. Then read the linked `references/<library>.md` for the chosen library's scope and tradeoffs. Don't silently substitute one library for another; if no entry fits, surface the gap to the user.
Methodology for evaluating a single sklearn-compatible learner (in particular, the `SkrubLearner` produced by `build-ml-pipeline`). Owns: which entry point to call (`skore.evaluate` first, the explicit report classes when needed), which cross-validator to pick from scikit-learn's catalogue, how to consume the structural metadata (`groups`, `times`, …) attached at build time via `.skb.mark_as_X(split_kwargs=...)`. Stops at "what does the report say". Defaults (metrics, plots) come from skore; only override on explicit user request. TRIGGER when: code calls `cross_val_score`, `cross_validate`, `classification_report`, or any handwritten metric print (`print(mean_squared_error(...))`); code calls `.skb.cross_validate(...)` (route through skore for richer output); user asks how to score, evaluate, or compare a single learner; user asks how to pick a cross-validator; user wants to see a report / metrics / diagnostic plots for a fitted learner. SKIP when: declaring the pipeline (use `build-ml-pipeline`); hyperparameter / model search (separate skill); fitting, persisting, or serving the final model; tracking or comparing experiments across multiple runs over time (separate skill). HOW TO USE: invoke before any evaluation call. **First, read the "Stop conditions" block at the top of the body and emit the Pre-flight checklist as visible text in your response — both are mandatory before any evaluation code is written.** The structural facts about the data (group keys, time ordering) should already be encoded at the X marker via `split_kwargs` — if they aren't and you can't tell from the data, return to `build-ml-pipeline` and ask the user. For symbol-level lookups, defer to `python-api` (skore symbols) and `python-api` (splitters); don't guess names from memory.
Owns data understanding BEFORE any model is designed. Places and executes `data/eda.py` (a jupytext `# %%` script) via the shared in-process runner, reads the streamed digest, then writes a persisted `data/eda.md` report (plus linked `data/eda_<table>.html` skrub `TableReport` pages) and the `## Data understanding (EDA)` section of `journal/JOURNAL.md`. The point is to surface the dataset facts — shape, dtypes, missingness, cardinality, target balance / skew, datetime / group structure, feature associations — that JUSTIFY the later learner / splitter / metric decisions, so the user understands *why* the modelling choices are made. Uses `skrub.TableReport` for dataframe overviews and the shared runner `audit-ml-pipeline/scripts/run_cells.py`. Stops at "EDA executed, `data/eda.md` + HTML written, JOURNAL EDA section updated." Never designs the model, never edits `src/<pkg>/`, never modifies the user's raw data files. TRIGGER — any of: - `iterate-ml-experiment` § 0 bootstrap, BEFORE the baseline design note — the G-EDA gate fires here (run / skip). - The user asks to "explore the data", "do an EDA", "profile the dataset", "what does the data look like", "understand the data". - A new or changed data source needs (re-)understanding before the next experiment. SKIP when: the workspace isn't scaffolded / bootstrapped yet — `iterate-ml-experiment` § 0 owns bootstrap ordering and will dispatch here at the G-EDA step; don't run standalone ahead of scaffolding (route to `iterate-ml-experiment` / `organize-ml- workspace`); there is no data to explore yet; the user wants to inspect a finished run's skore report rather than the raw dataset (`audit-ml-pipeline`); the user is past data understanding and wants pipeline / evaluation mechanics (`build-ml-pipeline` / `evaluate-ml-pipeline`); a pure symbol lookup (`python-api`); EDA is already recorded (`data/eda.md` + the JOURNAL EDA section exist) and the user is not asking to refresh it. HOW TO USE: run the Detection step (does `data/eda.md` + the JOURNAL EDA section already exist?), emit the Pre-flight checklist as visible text, read the Stop conditions, then place `data/eda.py` from `templates/eda.py`, execute it via the shared runner, read the digest, and author `data/eda.md` + the JOURNAL EDA section. Always resolve skrub / pandas / polars symbols via `python-api`, never from memory.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A set of skills to team up with you in your machine learning experimentation journey. It helps you at:
So we aim at allowing you to focus on the science, letting AI agents to take care about the implementation but guided by two important ingredients: great libraries for the maintainability and good methodologies to make experiments right.
In practice, from a prompt such as:
╭────────────────────────────────────────────────────────────────────────╮
│ > Given the context in the file `data/README.md` and the data located │
│ in `data/`, let's build a first machine learning pipeline that will │
│ serve as baseline for the next experiments that we are going to run │
│ together. │
╰────────────────────────────────────────────────────────────────────────╯
you can expect your agent to start experimenting with you. The skills are working pretty well with models such as Claude Opus and Sonnet and gives really good results with smaller models such as Qwen 3.6 30B or DeepSeek v4 Flash. In terms of agent's harnessing, we tested them with Claude Code, OpenCode, Cursor, GitHub Copilot and do not witness any significant difference in terms of skills invocation.
You can install the skills using the skore CLI that you can install from PyPI or from
conda-forge and run the following command:
skore skills install
You can use uvx or pixi exec to install the skore CLI and directly run the
command in an isolated environment:
uvx --from skore-cli skore skills install
or
pixi exec --spec skore-cli skore skills install
If you prefer npx, then you can use:
npx skills add probabl-ai/skills
If you only use Claude Code and prefer the native plugin flow, this repo is also a Claude Code plugin marketplace:
/plugin marketplace add probabl-ai/skills
/plugin install probabl-skills@probabl-skills
/plugin update pulls new releases.
| Skill | Description |
|---|---|
| explore-ml-data | Explore the dataset before designing any model. |
| build-ml-pipeline | Build a machine learning pipeline from the data source to the learner, including multi-tables engineering. |
| evaluate-ml-pipeline | Evaluate a complex machine learning pipeline and get structured reports including metrics, plots, and diagnostics. |
| test-ml-pipeline | Make sure that your machine learning pipeline is production-ready statistically and functionally. |
| smoke-test-ml-pipeline | Stress test your machine learning pipeline on future data to make sure it works. |
| audit-ml-pipeline | Once testing and the experiment is done, audit by loading a skore report and investigate. |
| Skill | Description |
|---|---|
| iterate-ml-experiment | Design, keep track of experiments and iterate on them. |
| iterate-from-skore | Use skore to run diagnostics and checks that can be reported and addressed in the next experiment. |
| iterate-from-user | As a user be in the loop and propose new experiments — free-text, a scientific article URL, or a resource link (GitHub issue / spec / reference repo). |
| Skill | Description |
|---|---|
| organize-ml-workspace | An organized workspace to keep track of your experiments. |
| python-code-style | Enforce good practices out-of-the-box for the Python ecosystem for your code. |
| python-env-manager | Bootstrapping the experiment setup based on your favorite Python environment manager. |
| data-science-python-stack | Opinionated one-library-per-job Python stack, organized into mandatory / user-choice / optional / transitive tiers. |
npx claudepluginhub probabl-ai/skills --plugin probabl-skillsComprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
A growing collection of Claude-compatible academic workflow bundles. Covers scientific figures, manuscript writing and polishing, reviewer assessment, citation retrieval, data availability, paper reading, literature search, response letters, paper-to-PPTX conversion, and evidence-grounded Chinese invention patent drafting. Rules are organized as reusable skill folders with explicit workflows and quality checks.
Complete creative writing suite with 10 specialized agents covering the full writing process: research gathering, character development, story architecture, world-building, dialogue coaching, editing/review, outlining, content strategy, believability auditing, and prose style/voice analysis. Includes genre-specific guides, templates, and quality checklists.
UI/UX design intelligence. 67 styles, 161 palettes, 57 font pairings, 25 charts, 15 stacks (React, Next.js, Vue, Svelte, Astro, SwiftUI, React Native, Flutter, Tailwind, shadcn/ui, Nuxt, Jetpack Compose). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient.
This skill should be used when users need to generate ideas, explore creative solutions, or systematically brainstorm approaches to problems. Use when users request help with ideation, content planning, product features, marketing campaigns, strategic planning, creative writing, or any task requiring structured idea generation. The skill provides 30+ research-validated prompt patterns across 14 categories with exact templates, success metrics, and domain-specific applications.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.