infra-fabric-launch | curry-train

Stats

Actions

Tags

infra-fabric-launch | curry-train

Infra · Fabric launch recipe

curryTrain uses Lightning Fabric, not the Lightning Trainer. Fabric is a minimal Fabric class (~5 lines of integration) that gives DDP / FSDP / mixed precision / device placement, while leaving the user in full control of the training loop.

Why Fabric, not Trainer

Subagent B's research is unambiguous: Trainer locks you into its loop abstraction. curryTrain wants methodology recipes (pre-validate, sanity, runs-diff) to drive the loop. The Trainer fights that. Fabric supports the recipes naturally.

The minimum integration

import lightning as L
from omegaconf import DictConfig

@hydra.main(version_base=None, config_path="configs", config_name="config")
def main(cfg: DictConfig):
    # 1. Fabric setup
    fabric = L.Fabric(
        accelerator="gpu",
        devices=cfg.parallelism.devices,
        strategy=cfg.parallelism.strategy,    # "ddp", "fsdp", "deepspeed_stage_3"
        precision=cfg.training.precision,     # "bf16-mixed", "16-mixed", "32"
    )
    fabric.launch()

    # 2. Build model + optimizer (still raw PyTorch)
    model = build_model(cfg.model)
    optimizer = torch.optim.AdamW(model.parameters(), lr=cfg.training.lr)
    scheduler = warmup_cosine_schedule(optimizer, ...)

    # 3. One Fabric call wraps both
    model, optimizer = fabric.setup(model, optimizer)
    train_loader = fabric.setup_dataloaders(build_loader(cfg.data))

    # 4. Raw training loop
    with Run(cfg) as run:
        for batch in train_loader:
            optimizer.zero_grad()
            loss = loss_fn(model(batch.x), batch.y)
            fabric.backward(loss)             # replaces loss.backward()
            optimizer.step()
            scheduler.step()
            run.log_metric(step=step, loss=loss.item(), ...)

The full diff vs. plain PyTorch:

fabric.launch() instead of manual torch.distributed.init_process_group.
fabric.setup(model, optimizer) instead of manual DDP wrapping.
fabric.setup_dataloaders(loader) instead of manual DistributedSampler.
fabric.backward(loss) instead of loss.backward().
fabric.print(...), fabric.save(...), fabric.load(...) for rank-aware utilities.

Strategy mapping

`cfg.parallelism.strategy`	What it gives
`"auto"`	Sensible default for the visible hardware.
`"ddp"`	DistributedDataParallel; smallest overhead, no sharding.
`"ddp_find_unused_parameters_true"`	DDP allowing unused params (avoid in production).
`"fsdp"`	FSDP with default sharding; can be configured via dataclass.
`"deepspeed_stage_3"`	DeepSpeed ZeRO-3 (heavier; only when FSDP is insufficient).

For full control of FSDP options:

from lightning.fabric.strategies import FSDPStrategy
from torch.distributed.fsdp import MixedPrecision

strategy = FSDPStrategy(
    auto_wrap_policy=...,
    mixed_precision=MixedPrecision(...),
    activation_checkpointing_policy=...,
    backward_prefetch="BACKWARD_PRE",
)
fabric = L.Fabric(strategy=strategy, ...)

Launching

curryTrain does not ship a custom launcher. Use torchrun:

torchrun --nproc_per_node=8 train.py experiment.name=...

Multi-node:

torchrun --nnodes=2 --nproc_per_node=8 \
  --rdzv-id=run42 --rdzv-backend=c10d --rdzv-endpoint=$MASTER:29500 \
  train.py experiment.name=...

Procedure when assisting a user

Confirm Lightning is installed (pip install lightning). The Fabric class is at lightning.Fabric.
Convert their existing single-process PyTorch training script to use Fabric using the 4-line diff above. Do not refactor toward LightningModule.
Confirm DDP works first (strategy="ddp"), then escalate to FSDP only if memory needs it.
For mixed precision, default to "bf16-mixed" on Ampere/Hopper. fp16 is brittle; use only for older GPUs.
After conversion, invoke the bench skill (e.g. ask Claude to "smoke-test the runtime for 5 steps") to confirm everything works end-to-end on the chosen strategy.

Boundaries

Fabric does not abstract training loop logic — that's intentional. Loop-level features (callbacks, hooks) are the user's job.
Fabric does not provide MoE-aware all-to-all out of the box. For MoE, additional primitive code is needed (primitive-experts).
Fabric does not replace TP / PP. For Megatron-style TP/PP, you compose Fabric (handles DDP/FSDP) with the parallelism primitives below.

Common mistakes

Mixing Lightning Trainer and Fabric → pick one. We pick Fabric.
Using loss.backward() instead of fabric.backward(loss) → silent bug under mixed precision and certain strategies.
Wrong precision ("32" when intending "32-true") → bf16-mixed is the modern default; "32" is full fp32.
Forgetting fabric.setup_dataloaders(...) → no DistributedSampler, ranks see overlapping data.

Related

skills/infra-hydra-config — the parallelism config group.
skills/primitive-distributed-optimizer — how Fabric's FSDP relates to the primitive.
skills/stage4-parallel-primitive-intro — when to escalate strategy.
Lightning Fabric docs.