Skill

diagnose

Diagnose a training failure or stall by inspecting recent logs, loss curves, OOM traces, NaN events, and config. Activate when the user asks "why did my training crash", "loss went to NaN", "OOM during step X", "training is not improving", or "help me debug this run". Delegates to the failure-diagnoser agent.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/curry-train:diagnose

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Read the artifacts of a recent run, classify the failure mode, and produce a ranked list of likely causes with concrete fixes. Most of the analytical work is delegated to the `failure-diagnoser` agent.

SKILL.md

55 lines · ~780 tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Diagnose a failed or stalled run

Read the artifacts of a recent run, classify the failure mode, and produce a ranked list of likely causes with concrete fixes. Most of the analytical work is delegated to the failure-diagnoser agent.

When to invoke

User asks: "why did this crash", "loss is NaN", "OOM at step X", "loss is flat", "gradients exploded", "training stalled".
After a non-zero-exit training run when the user wants help understanding what happened.

Inputs

$1 (optional): path to a run directory or a single log file. If omitted, look for the most recently modified directory under runs/ or outputs/ in the current project.

Procedure

Resolve the target. If the user gave a directory, use it. Else find the most recent runs/<timestamp>/ or outputs/<timestamp>/. If nothing exists, ask the user where their logs are.
Collect the evidence bundle. Read at most:
- The last ~500 lines of any *.log or stderr.txt in the directory.
- config.yaml (the resolved Hydra config).
- The tail of metrics.jsonl (last 200 entries).
- git_sha.txt, seed.txt, cuda_info.txt if present.
- Any core.* traceback or Python traceback files.

Spawn the failure-diagnoser agent. Hand it the evidence bundle and ask for a structured diagnosis (see agents/failure-diagnoser.md). Output schema:

classification: <NaN | OOM | divergence | stall | crash | unclear>
primary cause: <one line>
contributing factors: <bulleted>
fix candidates: <ranked, each with concrete code or config change>
verification next step: <a single command or skill to confirm the fix>

Suggest the verification skill. Match the classification to one of:
- NaN / divergence → stage5-loss-spike-rollback (recipe to recover) or stage2-init-loss-check (early sanity).
- OOM → primitive-recompute (activation checkpointing) or batch-size reduction.
- Stall → stage2-overfit-single-batch (verify pipeline at all) or stage2-grad-flow-viz (find dead layers).
- Unclear → ask the user for more context before spending more compute.

Boundaries

Do not modify the user's code automatically. Propose fixes; let the user apply them.
Do not blindly trust a single log line. Cross-check loss / grad-norm / lr / step trajectories before classifying.
Do not invent failure modes. If the agent cannot classify with confidence, output classification: unclear and ask for more information.

Failure modes

No log bundle found: ask the user where their training stdout/stderr ends up. Many setups redirect to /tmp/ or to a job scheduler.
Log truncated mid-traceback: tell the user explicitly that the trace is incomplete and which line was the last visible.

diagnose

Invocation

Context Preview

SKILL.md

diagnose

Invocation

Context Preview

SKILL.md

Diagnose a failed or stalled run

When to invoke

Inputs

Procedure

Boundaries

Failure modes

Similar Skills

Diagnose a failed or stalled run

When to invoke

Inputs

Procedure

Boundaries

Failure modes

Similar Skills