Skill

result-analyze

Paired statistical analysis of multi-seed experiment outputs from pipeline-scaffold. Use when the user has results.json files across variants and wants confidence intervals, paired t / Wilcoxon, and a primary-metric callout.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/research-helper:result-analyze

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Reads `outputs/seed-*/results.json` files across one or more variants and produces a paired statistical analysis markdown + JSON sidecar at `research/analysis/<slug>-<date>.md`.

Supporting Files

scripts/analyze.pyscripts/requirements.txtscripts/test_analyze.py

SKILL.md

80 lines · ~1k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitJun 1, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

result-analyze

Reads outputs/seed-*/results.json files across one or more variants and produces a paired statistical analysis markdown + JSON sidecar at research/analysis/<slug>-<date>.md.

When to use

Triggers: "analyze my results", "compare baseline vs treatment", "run statistical tests on these experiments", "is the difference significant?", "what do these results tell me?"

Anti-triggers:

"Plot my results" — out of scope for v0.5 (text-only report)
"Run my experiment" — that's pipeline-scaffold
"Decide whether my hypothesis is true" — the report gives the stats; interpretation is the user's

Process

Identify the variants. From the user's description, figure out what's being compared. Common patterns:
- Baseline vs treatment: 2 variants
- Hyperparameter sweep: N variants (one per setting)
- Ablation: N+1 variants (full + each ablation)
- Single-variant analysis: just per-variant summary (no pairwise)
Look for an experiment-design spec. Check research/experiments/<slug>-<date>.md. If one matches, pass it as --from-spec so the primary metric is inherited.
Build the glob patterns. Default pipeline-scaffold layout is outputs/seed-<S>/results.json inside the experiment directory. Cross-experiment comparisons require explicit paths:
```
--variant baseline:'research/code/<slug>/outputs/seed-*/results.json' \
--variant treatment:'research/code/<slug>-v2/outputs/seed-*/results.json'
```
Ordering matters for sign of Δ. Pairs are computed as (first variant) − (second variant), so list the baseline second when you want positive Δ to mean "the alternative improved over baseline". E.g., --variant treatment:... --variant baseline:... yields Δ = treatment − baseline.

Call the script.

python skills/result-analyze/scripts/analyze.py \
  --variant baseline:'research/code/<slug>/outputs/seed-*/results.json' \
  --variant treatment:'research/code/<slug>-v2/outputs/seed-*/results.json' \
  --from-spec research/experiments/<slug>-<date>.md

Options:

--ci-level 0.95 (default)
--bootstrap-iterations 10000 (default)
--correction holm (default; or bonferroni, none)
--primary-metric NAME (overrides --from-spec)
--output PATH (overrides default location)
--force (overwrite existing)

Walk the user through the report. Surface:
- Primary-metric callout: mean difference, CI, p-value (raw + corrected), significance flag
- Whether correction changed any conclusions vs raw p
- Any warnings (dropped seeds, missing metrics, n=1 cases)
- The path to the markdown + JSON sidecar
Don't over-interpret. The report describes what the numbers show. Avoid claiming a result is "definitive" off a small n.

Sanity checks to mention before drawing conclusions

Were the same seeds used in each variant? If not, the intersection-warning is informative — fewer effective seeds means less statistical power.
Was num_examples consistent across runs? A variant with fewer eval examples is not directly comparable on raw accuracy.
Is the primary metric the one the spec pre-registered? If result-analyze defaulted to alphabetical-first (warning emitted), call this out.
If n_seeds < 5, flag that the analysis has very limited statistical power. The Wilcoxon test in particular needs ≥ 6 pairs for a one-tailed p to even reach significance.

Output location

research/analysis/<slug>-<date>.md + sidecar <slug>-<date>.json. The JSON sidecar is the machine-readable version (same data, no prose) — point downstream tools at it.

What this skill DOES NOT do

Plotting (deferred to v0.5.x)
Bayesian analysis
Auto-running of experiments
Cross-condition interpretation (it reports; the user interprets)

result-analyze

Invocation

Context Preview

Supporting Files

SKILL.md

result-analyze

Invocation

Context Preview

Supporting Files

SKILL.md

result-analyze

When to use

Process

Sanity checks to mention before drawing conclusions

Output location

What this skill DOES NOT do

Similar Skills

result-analyze

When to use

Process

Sanity checks to mention before drawing conclusions

Output location

What this skill DOES NOT do

Similar Skills