From skillry-data-ml-ai-engineering
Use when you need to review an ML training pipeline for reproducibility, data/train/val/test splitting, data leakage, checkpointing, and experiment tracking (MLflow/W&B).
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-data-ml-ai-engineering:313-ml-training-pipeline-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review a model training pipeline so that runs are reproducible, free of data leakage, properly checkpointed, and tracked. Cover seeding and determinism, dataset splitting and the train/validation/test boundary, leakage from preprocessing or target encoding, feature/label alignment, checkpointing and resumption, and experiment tracking with MLflow or Weights & Biases. The goal is that a given co...
Review a model training pipeline so that runs are reproducible, free of data leakage, properly checkpointed, and tracked. Cover seeding and determinism, dataset splitting and the train/validation/test boundary, leakage from preprocessing or target encoding, feature/label alignment, checkpointing and resumption, and experiment tracking with MLflow or Weights & Biases. The goal is that a given commit plus a given dataset version reproduces the same metrics, and that reported scores are honest (no information from the test set leaked into training).
find . -name "train*.py" -o -name "*trainer*.py" -o -name "*.yaml" -path "*conf*" 2>/dev/null | grep -v __pycache__ | head -30
grep -rn "fit(\|\.train()\|trainer\.\|Trainer(\|optimizer" . --include="*.py" | head -20
grep -rn "seed\|random_state\|manual_seed\|set_seed\|deterministic\|np.random\|torch.use_deterministic" . --include="*.py" | head -25
Confirm seeds are set for Python random, NumPy, and the framework (PyTorch/TF), and that random_state is passed to every split and shuffle.
# Splitting
grep -rn "train_test_split\|StratifiedKFold\|TimeSeriesSplit\|GroupKFold\|\.split(" . --include="*.py" | head -20
# Preprocessing fit location (must fit on train only)
grep -rn "\.fit(\|fit_transform\|StandardScaler\|fit_resample\|SMOTE\|TargetEncoder\|impute" . --include="*.py" | head -25
The split must happen before any fit/fit_transform. Scalers, imputers, encoders, and resamplers must be fit on the training fold only and applied to validation/test — never fit_transform on the full dataset.
For time-series or any data with a time dimension, confirm the split is chronological (no future rows in training) and that features do not include information unavailable at prediction time (target leakage, post-event columns).
grep -rn "checkpoint\|save_model\|state_dict\|ModelCheckpoint\|resume\|load_state" . --include="*.py" | head -20
grep -rn "mlflow\|wandb\|log_param\|log_metric\|log_artifact\|set_experiment" . --include="*.py" | head -25
Confirm checkpoints save model + optimizer + epoch so a run can resume, and that params, metrics, dataset version, and code commit are logged per run.
random_state is passed to every split/shuffle.fit.fit_transform on the full dataset prior to splitting.TimeSeriesSplit), not random shuffling.Full reproducibility seeding (Python / PyTorch):
import os, random, numpy as np, torch
def set_global_seed(seed: int = 42) -> None:
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True, warn_only=True)
Leakage-safe split and preprocessing with a pipeline (scikit-learn):
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# 1) Split FIRST so the scaler never sees test rows.
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42)
# 2) Fit the scaler INSIDE a pipeline on the training fold only.
pipe = Pipeline([
("scaler", StandardScaler()), # fit on X_tr during pipe.fit
("clf", LogisticRegression(max_iter=1000, random_state=42)),
])
pipe.fit(X_tr, y_tr) # no leakage from X_te
print("test acc:", pipe.score(X_te, y_te)) # test touched once
Experiment tracking + resumable checkpoint (MLflow + PyTorch):
import mlflow, torch
mlflow.set_experiment("churn-model")
with mlflow.start_run():
mlflow.log_params({"lr": lr, "epochs": epochs, "seed": 42})
mlflow.set_tag("git_commit", git_sha)
mlflow.set_tag("dataset_version", dataset_hash) # tie run to data
for epoch in range(start_epoch, epochs):
train_one_epoch(model, optimizer, loader)
val = evaluate(model, val_loader)
mlflow.log_metric("val_auc", val["auc"], step=epoch)
torch.save( # resumable checkpoint
{"epoch": epoch,
"model": model.state_dict(),
"optimizer": optimizer.state_dict()},
f"ckpt_{epoch}.pt")
mlflow.log_artifact("ckpt_final.pt")
scaler.fit_transform(X) before splitting — test statistics leak into training.Produce a structured report with:
file:line.file:line | issue | impact on metric honesty | fix.npx claudepluginhub fluxonlab/skillry --plugin skillry-data-ml-ai-engineeringSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.