From remote-gpu-train
Run a long training (or any long-running) job on a remote shared-GPU box over SSH so it survives disconnects, and keep the agent in the loop while it runs. Use for launching, watching, and managing training / eval / ablation runs on a remote machine or cluster. First use in a new project needs a one-time adapt step (see below).
How this skill is triggered — by the user, by Claude, or both
Slash command
/remote-gpu-train:remote-gpu-trainThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Engine: `train.sh` in this skill's directory. Run it bare for usage.
Engine: train.sh in this skill's directory. Run it bare for usage.
train.sh gpus show GPUs, mark which are FREE
train.sh run <gpu> <tag> <args...> launch on <gpu> in a detached tmux session, logging to a file
train.sh watch <tag> stream milestones until the run ends (crash or done)
train.sh cp <local> <remote> copy a file into the repo (remote is repo-relative)
train.sh ssh [cmd...] anything else: git pull/status, tail a log, tmux ls/kill
If the CONFIG block at the top of train.sh is still on its placeholder defaults
(HOST=mybox, REPO=dev/myproject, LAUNCH='python train.py'), STOP and adapt it
first: read ADAPTING.md in this directory and follow it. It tailors the connection,
the launch command, and the "running" log signal to this box and workflow without
breaking the pillars the skill rests on. Once train.sh gpus returns real GPUs and
run points at the real launch command, the skill is ready and you operate it as below.
train.sh gpus confirms the box is reachable and shows which cards are free.train.sh run <gpu> <tag> <args> launches on <gpu> in a detached tmux session
named <tag>, logging to ~/<RUNS>/<tag>.log. The args after <tag> are appended
to the configured launch command. Pick a FREE GPU and leave capacity for others.train.sh watch <tag> follows the run until it ends. In Claude Code, run it via the
Monitor tool with persistent: true (a run outlasts Monitor's 1h cap). It prints
RUNNING once the job is actually stepping and RUN ENDED when the tmux session
dies. In another harness, drive the same command from that harness's own poll /
notify loop.train.sh ssh 'tail -20 ~/<RUNS>/<tag>.log'.RUN ENDED only means the session died: a crash and a clean finish look identical, so
confirm from the log tail that the expected outputs were written. A dropped connection
shows unreachable, retrying, never a false RUN ENDED, since the run is a session on
the remote and keeps going.
The pillars this skill rests on, and how to tailor it to a different box, workflow, or
harness, are in ADAPTING.md.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub maedmatt/remote-gpu-train --plugin remote-gpu-train