From curry-train
Run a short, reproducible benchmark of one optimizer step (forward + backward + optimizer step over N microbatches) using the project's registered runtime. Activate when the user asks to "benchmark a training step", "measure throughput", "time one optimizer step", or "smoke test the runtime". Wraps run_accumulated_step from curry_train.benchmark.
How this skill is triggered — by the user, by Claude, or both
Slash command
/curry-train:benchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Execute `run_accumulated_step` from `curry_train.benchmark` over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is **smoke testing**, not throughput tuning — see `stage4-capacity-sweep` and `stage4-optuna-integration` for proper benchmarking.
Execute run_accumulated_step from curry_train.benchmark over a small number of optimizer steps and report wall-clock, tokens/sec, peak memory, and final loss. The intent is smoke testing, not throughput tuning — see stage4-capacity-sweep and stage4-optuna-integration for proper benchmarking.
new-experiment skill, to confirm the freshly scaffolded model produces a non-NaN loss in one step.--model=<name> (optional): a registered model name. Defaults to whatever the project's configs/default.yaml selects.--steps=N (optional, default 5): number of optimizer steps to run.--accum=K (optional, default 1): gradient-accumulation microbatches per optimizer step. Should match GradientAccumulation.steps in the config.Resolve runtime and model. Use curry_train.models.create_runtime(<model>) and runtime.build_model() to get a ModelHandle.
Build the gradient-accumulation primitive.
from curry_train.primitives import GradientAccumulation
accum = GradientAccumulation(steps=K)
Loop N optimizer steps. Each step calls run_accumulated_step(runtime, handle, microbatches, accum, ...) with synthetic or fixture microbatches (the model package should expose a dummy_batch() helper for this purpose).
Measure. Record per-step:
time.perf_counter)AccumulatedStepResult.tokens)torch.cuda.max_memory_allocated() if CUDA available)OptimizerStepResult.grad_norm)Report. Print a small markdown table. Add a single warning line if loss is NaN, grad-norm is 0, or step time variance > 50 % (suggests warmup or first-step compile cost not yet absorbed).
# bench — model=<name>, steps=N, accum=K
| step | loss | grad-norm | tokens | step-time (s) | peak-mem (MiB) |
|------|--------|-----------|--------|---------------|----------------|
| 1 | ? | ? | ? | ? | ? |
| 2 | ? | ? | ? | ? | ? |
| ...
throughput: ? tokens/sec (mean of steps 2..N)
verdict: <ok | NaN-loss | grad-norm-zero | step-time-unstable>
dummy_batch() from the model package or generate torch.randn of the expected shape.stage4-capacity-sweep and Optuna-driven sweeps.curry_train.models.list_models() and ask the user which one.diagnose skill (the user can ask to debug the failure in natural language).stage1-scaffolder for the template.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-train