From autoresearch
Computes MAD-based confidence scores for experiment results to determine if improvements exceed noise. Use after 3+ positive metric data points.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autoresearch:confidence-scoringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Determines whether an observed improvement is real or within measurement noise using Median Absolute Deviation (MAD).
Determines whether an observed improvement is real or within measurement noise using Median Absolute Deviation (MAD).
confidence: null.Given all metric values in the current segment (positive values only):
|value - median|. Take the sorted median of those absolute deviations.keep-status metric value (respecting optimization direction).|best_kept - baseline|delta / MADnull — no measurable noise to compare against.keep results exist yet: return null.null — no improvement to score.The confidence score is a multiple of the session's noise floor:
| Score | Meaning | Action |
|---|---|---|
| ≥ 2.0× | Improvement likely real | Safe to trust |
| 1.0×–2.0× | Marginal — could be noise | Consider re-running to confirm |
| < 1.0× | Within noise floor | Treat as no improvement |
When logging an experiment result to autoresearch.jsonl:
"confidence": null in the JSONL record.keep vs discard: the confidence score is advisory. It never auto-discards. But flag improvements below 1.0× in your ASI notes as "within noise — may not be real."Runs: [15200, 15400, 14800, 15100, 14600]
Median: 15100
Deviations: [100, 300, 300, 0, 500] → sorted: [0, 100, 300, 300, 500]
MAD: 300
Baseline: 15200 (first run)
Best kept: 14600
Delta: |14600 - 15200| = 600
Confidence: 600 / 300 = 2.0× ← improvement is real
npx claudepluginhub pbdeuchler/llm-plugins --plugin autoresearchAnalyzes A/B test results for statistical significance, sample size validation, confidence intervals, guardrail metrics, and recommendations on launch, extension, or termination. Useful for evaluating experiments, interpreting split test data, or deciding variant rollouts.
Designs statistically rigorous A/B tests and interprets experiment results with ship/iterate/kill recommendations. Calculates sample size, run time, and flags design risks.
Analyzes A/B tests and experiments with statistical rigor: assesses power, significance, validity, segments; recommends ship/kill/extend.