From ml-research
Systematic experiment tracking, comparison, and analysis for machine learning research.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ml-research:ml-experimentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematic experiment tracking, comparison, and analysis for machine learning research.
Systematic experiment tracking, comparison, and analysis for machine learning research.
Directory Structure:
logs/
├── 2026-02-22/
│ ├── 14-30-22/ # Timestamp of run
│ │ ├── .hydra/
│ │ │ ├── config.yaml # Full resolved config
│ │ │ ├── overrides.yaml # CLI overrides
│ │ │ └── hydra.yaml
│ │ ├── checkpoints/
│ │ ├── metrics.csv
│ │ └── train.log
│ └── 15-45-10/
└── experiment_registry.json # Central registry
Interactive Setup - Ask User:
Generate: configs/experiment/<name>.yaml
# @package _global_
# Metadata
name: "vit_imagenet_finetuning"
description: "Fine-tune Vision Transformer on ImageNet subset"
tags: ["vision-transformer", "transfer-learning", "imagenet"]
# Compose from existing configs
defaults:
- override /model: vit_base
- override /data: imagenet
- override /trainer: gpu_multi
- override /logger: wandb
# Seed
seed: 42
# Model overrides
model:
pretrained: true
freeze_backbone: false
num_classes: 1000
optimizer:
lr: 0.001
# Data overrides
data:
batch_size: 256
num_workers: 8
image_size: 224
# Trainer overrides
trainer:
max_epochs: 50
precision: "16-mixed"
devices: 4
strategy: "ddp"
# Callbacks
callbacks:
model_checkpoint:
monitor: "val/acc"
mode: "max"
save_top_k: 3
early_stopping:
monitor: "val/loss"
patience: 10
mode: "min"
# Logger
logger:
wandb:
project: "imagenet-classification"
tags: ${tags}
notes: ${description}
Run experiment:
python src/train.py experiment=vit_imagenet_finetuning
See templates/experiment-templates.yaml for common experiment types.
# In LightningModule
def on_train_end(self):
# Log experiment to registry
from scripts.experiment_registry import log_experiment
log_experiment(
name=self.hparams.experiment_name,
config_path=self.hparams.config_path,
metrics={
"best_val_acc": self.trainer.callback_metrics["val/acc"].item(),
"best_val_loss": self.trainer.checkpoint_callback.best_model_score.item(),
"epochs_trained": self.trainer.current_epoch,
},
hyperparameters={
"lr": self.hparams.optimizer.lr,
"batch_size": self.hparams.data.batch_size,
"optimizer": self.hparams.optimizer._target_,
},
tags=self.hparams.tags,
)
logs/experiment_registry.json:
{
"experiments": [
{
"id": "exp_001",
"name": "baseline_resnet50",
"timestamp": "2026-02-22T14:30:22",
"config": "configs/experiment/baseline.yaml",
"status": "completed",
"metrics": {
"best_val_acc": 0.876,
"best_val_loss": 0.324,
"final_train_loss": 0.145,
"epochs_trained": 45
},
"hyperparameters": {
"lr": 0.001,
"batch_size": 128,
"optimizer": "AdamW"
},
"runtime": "2h 34m",
"gpu_count": 2,
"tags": ["baseline", "resnet"]
}
]
}
See scripts/experiment_registry.py for implementation.
Compare specific experiments:
python scripts/compare_experiments.py exp_001 exp_002 exp_003
Output:
ID Name Val Acc Val Loss LR Batch Runtime
exp_001 baseline_resnet50 0.876 0.324 0.001 128 2h 34m
exp_002 resnet50_tuned 0.892 0.298 0.005 256 3h 12m
exp_003 resnet50_dropout 0.884 0.312 0.001 128 2h 45m
Comparison plot:
# Generates logs/experiment_comparison.png
# - Bar charts for accuracy and loss
# - Side-by-side comparison
See scripts/compare_experiments.py for full implementation.
# configs/experiment/baseline.yaml
name: "baseline"
description: "Baseline with default hyperparameters"
tags: ["baseline"]
# Use defaults from model/data/trainer
model: {}
data: {}
trainer:
max_epochs: 100
# configs/experiment/ablation_dropout.yaml
name: "ablation_dropout"
description: "Effect of dropout rate"
tags: ["ablation", "regularization"]
# Run with: --multirun model.dropout=0.0,0.1,0.2,0.3,0.4,0.5
model:
dropout: 0.3
# configs/experiment/hp_optimization.yaml
name: "hp_optimization"
description: "Hyperparameter optimization with Optuna"
tags: ["optimization", "tuning"]
defaults:
- override hydra/sweeper: optuna
hydra:
sweeper:
n_trials: 100
direction: maximize
study_name: "model_optimization"
params:
model.hidden_dims:
type: categorical
choices: [[512,256], [1024,512,256]]
model.optimizer.lr:
type: float
low: 0.0001
high: 0.01
log: true
optimized_metric: "val/acc"
See templates/ for more experiment types.
# Save package versions
pixi list > logs/exp_001/environment.txt
# or
uv pip freeze > logs/exp_001/requirements.txt
# Save git commit
git rev-parse HEAD > logs/exp_001/commit_hash.txt
# Save system info
python -c "import torch; print(f'PyTorch: {torch.__version__}\nCUDA: {torch.version.cuda}')" > logs/exp_001/system_info.txt
# Checkout exact code
git checkout $(cat logs/exp_001/commit_hash.txt)
# Restore environment
pixi install
# or
uv pip install -r logs/exp_001/requirements.txt
# Run with exact config
python src/train.py \
--config-path ../logs/exp_001/.hydra \
--config-name config
Reproducibility Checklist:
python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/
Generates:
analysis.png - Training curves (loss, accuracy, LR)Example:
Experiment Summary:
Best Val Acc: 0.8921
Best Val Loss: 0.2984
Epochs Trained: 45
Final LR: 0.000123
# List all experiments
python scripts/list_experiments.py
# Filter by tags
python scripts/list_experiments.py --tags baseline ablation
# Export to CSV
python scripts/export_results.py --output results.csv
# Generate markdown report
python scripts/generate_report.py --format markdown --output report.md
See examples/experiment-analysis.md for detailed analysis workflows.
import wandb
api = wandb.Api()
runs = api.runs("my-project")
# Filter runs
runs = api.runs("my-project", filters={"tags": "baseline"})
# Get metrics
for run in runs:
print(f"{run.name}: val_acc={run.summary['val/acc']:.4f}")
# Download artifacts
best_run = runs[0]
best_run.file("model.pt").download()
# Open workspace
wandb workspace
# Generate report
wandb reports create --title "Experiment Comparison"
# Initialize sweep
wandb sweep configs/sweep/bayesian_optimization.yaml
# Run sweep agent
wandb agent <sweep-id>
See examples/wandb-integration.md for complete guide.
vit_large_imagenet_pretrainedexp_2026_02_baselineablation_, optimization_, baseline_configs/experiment/
├── baselines/
│ ├── resnet_baseline.yaml
│ └── vit_baseline.yaml
├── ablations/
│ ├── ablation_dropout.yaml
│ └── ablation_lr.yaml
└── optimizations/
└── hp_optimization.yaml
Purpose: Establish reference performance.
name: "baseline"
tags: ["baseline"]
model: {} # Use defaults
Purpose: Isolate effect of single component.
name: "ablation_batch_norm"
tags: ["ablation"]
model:
use_batch_norm: false # Remove batch norm
Purpose: Find optimal hyperparameters.
name: "hp_tuning"
tags: ["optimization"]
# Use with --multirun or Optuna sweeper
Purpose: Fine-tune pretrained model.
name: "transfer_learning"
tags: ["transfer-learning"]
model:
pretrained: true
freeze_backbone: true # Freeze early layers
Purpose: Compare different architectures.
# Run multiple architectures
python src/train.py --multirun \
experiment=architecture_search \
model=resnet18,resnet50,vit_base
# Create new experiment
python src/train.py experiment=<name>
# List experiments
python scripts/list_experiments.py
# Compare experiments
python scripts/compare_experiments.py exp_001 exp_002 exp_003
# Analyze experiment
python scripts/analyze_experiment.py logs/2026-02-22/14-30-22/
# Clean old experiments (keep best 5)
python scripts/clean_experiments.py --keep-best 5
# Export results
python scripts/export_results.py --output results.csv
# Generate report
python scripts/generate_report.py --format markdown --output report.md
Experiment registry not updating:
logs/experiment_registry.jsonon_train_end callback is calledCan't reproduce results:
W&B runs not logging:
WANDB_API_KEY is setwandb login againMetrics not saving:
log_every_n_steps is setExperiments are well-organized and easily comparable!
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub nishide-dev/claude-code-ml-research --plugin ml-research