From harnessml
Use before creating any experiment. This is the thinking step. If you skip it, you'll run experiments that don't teach you anything.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harnessml:experiment-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use before creating any experiment. This is the thinking step. If you skip it, you'll run experiments that don't teach you anything.
Use before creating any experiment. This is the thinking step. If you skip it, you'll run experiments that don't teach you anything.
Before you create an experiment, answer these five questions. If you can't answer all five, you're not ready.
Not "try lower learning rate." A question:
The question determines what you look at in the results. Without it, you'll just check whether the aggregate metric went up or down — and learn nothing.
One change per experiment. If you change learning_rate AND add features AND switch calibration, you can't attribute the result to anything.
Exception: Mechanically coupled changes. Lowering learning_rate requires raising n_estimators to compensate — that's one logical change, not two. Document the coupling in the hypothesis.
Be specific. Not "the metric improves." Think about:
This is what makes the result interpretable. If you predicted fold 3 would improve and it did, that's confirmation. If folds you didn't expect to change moved instead, that's a different signal entirely.
Equally important. If the metric gets worse:
Write it in this structure:
Changing [single variable] because [mechanism/reasoning] expecting [specific predicted outcome with magnitude] which would mean [what it tells us about the model/data].
Example:
Changing reg_lambda from 0 to 0.5 on xgb_main because folds 2 and 6 show >0.02 train-test Brier gap, suggesting the model memorizes low-sample folds. Expecting folds 2 and 6 to improve by 0.005-0.01, other folds neutral, overall Brier improves 0.003-0.005. Which would mean the ensemble is currently overfitting on edge cases and uniform regularization is sufficient to address it.
Before your first attempt, sketch a rough plan for the strategy:
This prevents the most common failure: trying one config, seeing it fail, and abandoning the entire strategy. You committed to investigating the question, not to one parameterization of it.
Sometimes the right move is not to run an experiment:
domain-research or diagnosis to generate a real hypothesis.npx claudepluginhub msilverblatt/harness-ml --plugin harnessmlDesigns ML experiments: ablation studies, baseline comparisons, experiment matrices; estimates GPU/API costs; generates config stubs, execution scripts, and analysis plans.
Maintains persistent ML experiment journals in Markdown files, logging hypotheses, changes, results, metrics, and learnings across sessions.
Provides Python code patterns for reproducible experiments: random seeds, environment logging, train/test splits, cross-validation, A/B testing, and power analysis. For ML/statistical designs.