From harnessml
Use after every experiment run. This is the core data science skill — reading results, understanding errors, and forming the next hypothesis. If you skip this, you're not doing science, you're doing random search.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harnessml:diagnosisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use after every experiment run. This is the core data science skill — reading results, understanding errors, and forming the next hypothesis. If you skip this, you're not doing science, you're doing random search.
Use after every experiment run. This is the core data science skill — reading results, understanding errors, and forming the next hypothesis. If you skip this, you're not doing science, you're doing random search.
A metric is a symptom. Diagnosis finds the cause. You don't treat symptoms — you understand the disease.
When you see "Brier improved by 0.004," that is not a conclusion. It's the beginning of a question: where did the improvement come from? Which predictions changed? Did calibration improve, or did discrimination improve? Is the gain robust across folds, or driven by one?
After every run:
pipeline(action="diagnostics")
pipeline(action="compare_latest")
Start here, but don't stop here.
This is where the real signal is.
What fold patterns tell you:
pipeline(action="diagnostics") # includes calibration curves
This is the purpose of diagnosis. Every diagnostic finding should generate a question:
| Finding | Question it generates |
|---|---|
| Fold 3 always worst | What's different about fold 3's data? Is it a time period, a subgroup, a distribution shift? |
| Train-test gap on tree models only | Are trees overfitting to interactions that don't generalize? Would regularization or max_depth reduction help? |
| Feature X has high importance but low univariate correlation | The model found a non-linear or conditional relationship. What interaction or transformation makes this explicit? |
| Two models have 0.98 correlation | They're learning the same thing. Can I differentiate them with different feature sets? Or should I drop one and add a different model family? |
| ECE degraded while Brier improved | Discrimination improved but calibration suffered. Is the calibration method appropriate? Would a different calibrator help? |
| New feature has zero importance | Is it redundant with an existing feature? Or does the model need a different functional form (e.g., binned instead of continuous)? |
| All folds improved slightly | Genuine structural improvement, but small. Is there more headroom in this direction, or is this the ceiling? |
Write down the next hypothesis before closing the diagnosis. If you finish diagnosing and don't know what to try next, you didn't dig deep enough.
When available, look at where the model's errors concentrate:
After every experiment, write:
## Diagnosis: [Experiment ID]
### What happened
[Metric changes — aggregate and per-fold]
### Why it happened
[Mechanism — not just "the metric went down" but WHY]
### What was confirmed
[Parts of the hypothesis that were supported]
### What was surprising
[Things you didn't predict — these are the most valuable]
### Next hypothesis
[What to investigate based on these findings]
Write key diagnostic findings to the notebook so they persist across sessions:
notebook(action="write", type="finding", content="[diagnosis insight]", experiment_id="...")
Not every diagnostic detail — just the insights that should inform future work.
npx claudepluginhub msilverblatt/harness-ml --plugin harnessmlAnalyzes experimental results, model outputs, and data with statistical rigor and diagnostic depth.
Enforces ML rigor: baseline comparisons vs dummy/linear models, cross-validation, interpretation, leakage prevention with sklearn templates.
Evaluate model performance — check for accuracy drops, data drift, and error patterns. Use when asked about "model accuracy dropped", "evaluate the model", "check for drift", or "model performance".