From curry-train
Diagnose a training failure or stall by inspecting recent logs, loss curves, OOM traces, NaN events, and config. Activate when the user asks "why did my training crash", "loss went to NaN", "OOM during step X", "training is not improving", or "help me debug this run". Delegates to the failure-diagnoser agent.
How this skill is triggered — by the user, by Claude, or both
Slash command
/curry-train:diagnoseThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Read the artifacts of a recent run, classify the failure mode, and produce a ranked list of likely causes with concrete fixes. Most of the analytical work is delegated to the `failure-diagnoser` agent.
Read the artifacts of a recent run, classify the failure mode, and produce a ranked list of likely causes with concrete fixes. Most of the analytical work is delegated to the failure-diagnoser agent.
$1 (optional): path to a run directory or a single log file. If omitted, look for the most recently modified directory under runs/ or outputs/ in the current project.Resolve the target. If the user gave a directory, use it. Else find the most recent runs/<timestamp>/ or outputs/<timestamp>/. If nothing exists, ask the user where their logs are.
Collect the evidence bundle. Read at most:
*.log or stderr.txt in the directory.config.yaml (the resolved Hydra config).metrics.jsonl (last 200 entries).git_sha.txt, seed.txt, cuda_info.txt if present.core.* traceback or Python traceback files.Spawn the failure-diagnoser agent. Hand it the evidence bundle and ask for a structured diagnosis (see agents/failure-diagnoser.md). Output schema:
classification: <NaN | OOM | divergence | stall | crash | unclear>
primary cause: <one line>
contributing factors: <bulleted>
fix candidates: <ranked, each with concrete code or config change>
verification next step: <a single command or skill to confirm the fix>
Suggest the verification skill. Match the classification to one of:
stage5-loss-spike-rollback (recipe to recover) or stage2-init-loss-check (early sanity).primitive-recompute (activation checkpointing) or batch-size reduction.stage2-overfit-single-batch (verify pipeline at all) or stage2-grad-flow-viz (find dead layers).classification: unclear and ask for more information./tmp/ or to a job scheduler.npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.