From example-skills
Guides ML experiment logging, versioning, and reproducibility using MLflow, Weights & Biases, DVC for tracking params, metrics, models, and artifacts in Python projects.
How this skill is triggered — by the user, by Claude, or both
Slash command
/example-skills:ml-experiment-trackerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill provides guidance for systematic machine learning experimentation with proper tracking, versioning, and reproducibility practices.
This skill provides guidance for systematic machine learning experimentation with proper tracking, versioning, and reproducibility practices.
Every experiment should log:
| Category | Items | Why |
|---|---|---|
| Code | Git commit hash, branch, diff | Reproduce exact code state |
| Data | Dataset version, hash, lineage | Know which data was used |
| Environment | Python version, dependencies, hardware | Reproduce runtime |
| Hyperparameters | All config values | Understand what changed |
| Metrics | Loss, accuracy, custom metrics | Compare performance |
| Artifacts | Models, plots, predictions | Preserve outputs |
project/
├── experiments/
│ ├── baseline/ # Initial experiments
│ ├── feature-engineering/ # Data improvements
│ ├── architecture/ # Model changes
│ └── hyperparameter/ # Tuning runs
├── data/
│ ├── raw/ # Original data (versioned)
│ ├── processed/ # Cleaned data
│ └── features/ # Feature store
└── models/
├── staging/ # Candidates
└── production/ # Deployed models
import mlflow
# Set experiment (creates if not exists)
mlflow.set_experiment("my-classification-project")
with mlflow.start_run(run_name="baseline-v1"):
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("batch_size", 32)
mlflow.log_param("epochs", 100)
# Training loop
for epoch in range(epochs):
train_loss = train_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
# Log metrics with step
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc
}, step=epoch)
# Log model
mlflow.pytorch.log_model(model, "model")
# Log artifacts (plots, configs)
mlflow.log_artifact("confusion_matrix.png")
mlflow.log_artifact("config.yaml")
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Training │───▶│ Staging │───▶│ Production │
│ Runs │ │ Review │ │ Deployed │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
Candidate Validated Monitored
Models Models Models
Stages:
import wandb
# Initialize with config
config = {
"learning_rate": 0.01,
"architecture": "ResNet50",
"dataset": "imagenet-subset",
"epochs": 100
}
run = wandb.init(
project="image-classification",
group="architecture-experiments", # Group related runs
tags=["baseline", "resnet"],
config=config,
notes="Testing ResNet50 baseline on subset"
)
# Training with automatic logging
for epoch in range(config["epochs"]):
metrics = train_and_eval(model, train_loader, val_loader)
wandb.log(metrics)
# Log media
wandb.log({"predictions": wandb.Image(pred_grid)})
wandb.log({"confusion_matrix": wandb.plot.confusion_matrix(...)})
wandb.finish()
# sweep_config.yaml
program: train.py
method: bayes # or grid, random
metric:
name: val_accuracy
goal: maximize
parameters:
learning_rate:
distribution: log_uniform_values
min: 0.0001
max: 0.1
batch_size:
values: [16, 32, 64, 128]
optimizer:
values: ["adam", "sgd", "adamw"]
early_terminate:
type: hyperband
min_iter: 10
# Initialize DVC in git repo
dvc init
# Track large files
dvc add data/training.csv
git add data/training.csv.dvc data/.gitignore
git commit -m "Add training data v1"
# Push to remote storage
dvc remote add -d storage s3://bucket/dvc
dvc push
# Create pipeline
dvc run -n preprocess \
-d src/preprocess.py -d data/raw \
-o data/processed \
python src/preprocess.py
# Reproduce pipeline
dvc repro
# dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/
outs:
- data/processed/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/
params:
- train.epochs
- train.learning_rate
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
experiment: {project}-{objective}
run: {date}-{description}-{variant}
model: {architecture}-{dataset}-{version}
Examples:
experiment: fraud-detection-baseline
run: 2024-01-15-xgboost-tuning-lr001
model: xgboost-transactions-v2.3.1
Track these metrics for model comparison:
Each significant experiment should document:
references/mlflow-setup.md - MLflow installation and configurationreferences/wandb-patterns.md - Advanced W&B features and sweepsreferences/reproducibility-checklist.md - Detailed reproducibility guidenpx claudepluginhub a-organvm/a-i--skills --plugin document-skillsProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.