From curry-train
Lightning Fabric integration recipe — minimal 5-line setup that gives DDP / FSDP / mixed precision / mixed-precision while keeping a raw PyTorch training loop. Activate when the user asks "Lightning Fabric", "torchrun", "DDP setup", "FSDP setup", "mixed precision", or wires up the launch script.
How this skill is triggered — by the user, by Claude, or both
Slash command
/curry-train:infra-fabric-launchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
curryTrain uses **Lightning Fabric**, not the Lightning Trainer. Fabric is a minimal `Fabric` class (~5 lines of integration) that gives DDP / FSDP / mixed precision / device placement, while leaving the user in full control of the training loop.
curryTrain uses Lightning Fabric, not the Lightning Trainer. Fabric is a minimal Fabric class (~5 lines of integration) that gives DDP / FSDP / mixed precision / device placement, while leaving the user in full control of the training loop.
Subagent B's research is unambiguous: Trainer locks you into its loop abstraction. curryTrain wants methodology recipes (pre-validate, sanity, runs-diff) to drive the loop. The Trainer fights that. Fabric supports the recipes naturally.
import lightning as L
from omegaconf import DictConfig
@hydra.main(version_base=None, config_path="configs", config_name="config")
def main(cfg: DictConfig):
# 1. Fabric setup
fabric = L.Fabric(
accelerator="gpu",
devices=cfg.parallelism.devices,
strategy=cfg.parallelism.strategy, # "ddp", "fsdp", "deepspeed_stage_3"
precision=cfg.training.precision, # "bf16-mixed", "16-mixed", "32"
)
fabric.launch()
# 2. Build model + optimizer (still raw PyTorch)
model = build_model(cfg.model)
optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.training.lr)
scheduler = warmup_cosine_schedule(optimizer, ...)
# 3. One Fabric call wraps both
model, optimizer = fabric.setup(model, optimizer)
train_loader = fabric.setup_dataloaders(build_loader(cfg.data))
# 4. Raw training loop
with Run(cfg) as run:
for batch in train_loader:
optimizer.zero_grad()
loss = loss_fn(model(batch.x), batch.y)
fabric.backward(loss) # replaces loss.backward()
optimizer.step()
scheduler.step()
run.log_metric(step=step, loss=loss.item(), ...)
The full diff vs. plain PyTorch:
fabric.launch() instead of manual torch.distributed.init_process_group.fabric.setup(model, optimizer) instead of manual DDP wrapping.fabric.setup_dataloaders(loader) instead of manual DistributedSampler.fabric.backward(loss) instead of loss.backward().fabric.print(...), fabric.save(...), fabric.load(...) for rank-aware utilities.cfg.parallelism.strategy | What it gives |
|---|---|
"auto" | Sensible default for the visible hardware. |
"ddp" | DistributedDataParallel; smallest overhead, no sharding. |
"ddp_find_unused_parameters_true" | DDP allowing unused params (avoid in production). |
"fsdp" | FSDP with default sharding; can be configured via dataclass. |
"deepspeed_stage_3" | DeepSpeed ZeRO-3 (heavier; only when FSDP is insufficient). |
For full control of FSDP options:
from lightning.fabric.strategies import FSDPStrategy
from torch.distributed.fsdp import MixedPrecision
strategy = FSDPStrategy(
auto_wrap_policy=...,
mixed_precision=MixedPrecision(...),
activation_checkpointing_policy=...,
backward_prefetch="BACKWARD_PRE",
)
fabric = L.Fabric(strategy=strategy, ...)
curryTrain does not ship a custom launcher. Use torchrun:
torchrun --nproc_per_node=8 train.py experiment.name=...
Multi-node:
torchrun --nnodes=2 --nproc_per_node=8 \
--rdzv-id=run42 --rdzv-backend=c10d --rdzv-endpoint=$MASTER:29500 \
train.py experiment.name=...
Confirm Lightning is installed (pip install lightning). The Fabric class is at lightning.Fabric.
Convert their existing single-process PyTorch training script to use Fabric using the 4-line diff above. Do not refactor toward LightningModule.
Confirm DDP works first (strategy="ddp"), then escalate to FSDP only if memory needs it.
For mixed precision, default to "bf16-mixed" on Ampere/Hopper. fp16 is brittle; use only for older GPUs.
After conversion, invoke the bench skill (e.g. ask Claude to "smoke-test the runtime for 5 steps") to confirm everything works end-to-end on the chosen strategy.
primitive-experts).loss.backward() instead of fabric.backward(loss) → silent bug under mixed precision and certain strategies.precision ("32" when intending "32-true") → bf16-mixed is the modern default; "32" is full fp32.fabric.setup_dataloaders(...) → no DistributedSampler, ranks see overlapping data.skills/infra-hydra-config — the parallelism config group.skills/primitive-distributed-optimizer — how Fabric's FSDP relates to the primitive.skills/stage4-parallel-primitive-intro — when to escalate strategy.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-train