From strategy-evaluation
Evaluate whether a variant backtest sweep has converged — read the results, measure top-K dispersion, IS/OOS rank stability, and parameter-plateau structure, then decide ship vs iterate vs kill. For quant researchers running parameter sweeps. Trigger with "check convergence", "are these variants converged?", "should I keep iterating?", "evaluate this sweep", "iteration check", "convergence report".
How this skill is triggered — by the user, by Claude, or both
Slash command
/strategy-evaluation:convergenceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The first capability of the **strategy-evaluation** plugin. Drop in a table of variant backtest results — get a convergence verdict and a recommendation: ship the winner, iterate further, or kill the line of research.
The first capability of the strategy-evaluation plugin. Drop in a table of variant backtest results — get a convergence verdict and a recommendation: ship the winner, iterate further, or kill the line of research.
This is the public, sanitized version of an internal pattern used to run multi-variant parameter sweeps in live crypto trading research at Martian Mobile. The convergence math is computed deterministically by a bundled Python analyzer; the qualitative call on which variants to try next is left to the agent.
Sibling evaluators (robustness, walk-forward, regime breakdown) can live alongside this one as additional skills under the same plugin.
┌─────────────────────────────────────────────────────────────────┐
│ STRATEGY EVALUATION · CONVERGENCE │
├─────────────────────────────────────────────────────────────────┤
│ INPUT │
│ ✓ Variant results: CSV (Parquet/SQLite → export to CSV first) │
│ ✓ Each row = one variant (params + metrics) │
│ ✓ You specify the convergence metric (Sharpe, hit-rate, etc.) │
├─────────────────────────────────────────────────────────────────┤
│ ANALYSIS (scripts/analyze.py, stdlib only) │
│ ✓ Dispersion of top-K variants on chosen metric │
│ ✓ Stability across IS / OOS windows (Spearman, if present) │
│ ✓ Parameter-space neighborhood check (winners cluster?) │
├─────────────────────────────────────────────────────────────────┤
│ OUTPUT │
│ ✓ Verdict: CONVERGED / ITERATE / KILL │
│ ✓ Reasoning with the numbers behind it │
│ ✓ If ITERATE: suggested next variants to try │
└─────────────────────────────────────────────────────────────────┘
Point me at your variant results. Tell me the metric you care about. I'll do the rest.
You: /convergence results.csv --metric sharpe_oos
Me: [Runs the analyzer, interprets the verdict, returns the report]
Required columns in your input file:
variant_id, run_id, params, etc.)Optional but useful:
sharpe_is + sharpe_oos) for rank-stabilityn_trades, etc.) — variants with too few trades get flaggedA CSV with one row per variant works fine:
variant_id, lookback, threshold, sharpe_is, sharpe_oos, n_trades
v001, 20, 0.5, 1.82, 1.61, 412
v002, 20, 0.6, 1.79, 1.55, 388
v003, 25, 0.5, 1.85, 1.58, 401
...
The analyzer infers what's a metric and what's a parameter from column names and types. IS/OOS pairs are detected by suffix (_is/_oos, _in/_out, _train/_test). Override any inference with --metric, --id-column, --sample-column.
Runnable samples ship beside this skill:
${CLAUDE_PLUGIN_ROOT}/skills/convergence/examples/results_converged.csv → CONVERGED${CLAUDE_PLUGIN_ROOT}/skills/convergence/examples/results_iterate.csv → ITERATEWhen invoked, do this:
Resolve the file path the user gave (a CSV, or a Parquet/SQLite they should export to CSV). If they didn't name a metric, you can let the analyzer auto-detect, but prefer to confirm the ranking metric if it's ambiguous.
Run the bundled script — it is pure Python 3 stdlib, no install needed:
python3 "${CLAUDE_PLUGIN_ROOT}/skills/convergence/scripts/analyze.py" <input.csv> --metric <metric> [--top-k 5] [--save]
Useful flags (defaults match the config below):
--metric ranking metric column (auto-detected if omitted)--top-k number of top variants to assess (default 5)--dispersion-threshold max relative dispersion for CONVERGED (default 0.05)--min-samples min samples/variant before a variant is trusted (default 200)--rank-threshold min IS/OOS Spearman for CONVERGED (default 0.6)--lower-is-better / --higher-is-better direction override--save also writes iteration_check_<timestamp>.mdThe script prints the full Markdown report and sets an exit code: 0 CONVERGED, 1 ITERATE, 2 KILL.
Present the analyzer's report. Then add value the deterministic script cannot:
Never override the analyzer's numbers — interpret them.
# Iteration Check | [Run name] | [Date]
## Verdict: CONVERGED | ITERATE | KILL
**[One-line summary of why]**
---
## Convergence Analysis
| Metric | Top-K Mean | Top-K Std | Dispersion | Threshold |
|--------|-----------|-----------|------------|-----------|
| sharpe_oos (rank) | 1.58 | 0.03 | 1.7% | < 5% ✓ |
| sharpe_is | 1.82 | 0.02 | 1.3% | < 5% ✓ |
**IS→OOS rank stability (Spearman):** 0.99 (threshold > 0.60 ✓)
**Read:** Top-K are clustered tightly on both IS and OOS — real convergence, not one lucky variant.
---
## Parameter-Space Check
- `lookback`: top-K at [20, 25, 30] — contiguous
- `threshold`: top-K at [0.5, 0.6] — contiguous
[ ] Single isolated peak — fragile, retest with more samples
[x] Plateau region — robust, safe to deploy
---
## Recommendation
[CONVERGED] Top variant `v10` is representative — neighbors perform similarly. Safe to advance.
CONVERGED (all must hold):
- Top-K dispersion < threshold (default 5%)
- IS/OOS rank correlation > threshold (default 0.6, if OOS available)
- Winners cluster in parameter space (plateau, not isolated points)
- Top variant NOT at the edge of the swept range
- Minimum sample size per variant met (if a sample column exists)
ITERATE if:
- Some criteria met but not all
- Top variant near the edge of the swept parameter range
- IS/OOS gap or weak rank stability suggests undersampling
KILL if:
- IS ranking does not survive OOS (Spearman ≤ 0)
- Top variants are scattered with high dispersion — no stable region
- Best variant is statistically indistinguishable from the median
Defaults work for most cases. Override via CLI flags:
metric: sharpe_oos
top_k: 5
dispersion_threshold: 0.05 # 5% relative
min_samples_per_variant: 200
rank_stability_threshold: 0.6
In real quant research, the failure mode isn't running too few backtests — it's calling one lucky variant a "winner" when its neighbors have completely different performance. Convergence-across-variants is the only honest signal that the parameter region has structure. This capability enforces the discipline.
Built by Martian Mobile. MIT-licensed, public sanitized version of an internal workflow.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub martianmobile/strategy-evaluation --plugin strategy-evaluation