Skill

remote-gpu-train

Run a long training (or any long-running) job on a remote shared-GPU box over SSH so it survives disconnects, and keep the agent in the loop while it runs. Use for launching, watching, and managing training / eval / ablation runs on a remote machine or cluster. First use in a new project needs a one-time adapt step (see below).

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/remote-gpu-train:remote-gpu-train

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Engine: `train.sh` in this skill's directory. Run it bare for usage.

Supporting Files

ADAPTING.mdtrain.sh

SKILL.md

48 lines · ~671 tokens

Stats

LanguageShell

Parent stars2

MaintenanceGood

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Remote GPU training

Engine: train.sh in this skill's directory. Run it bare for usage.

train.sh gpus                       show GPUs, mark which are FREE
train.sh run <gpu> <tag> <args...>  launch on <gpu> in a detached tmux session, logging to a file
train.sh watch <tag>                stream milestones until the run ends (crash or done)
train.sh cp <local> <remote>        copy a file into the repo (remote is repo-relative)
train.sh ssh [cmd...]               anything else: git pull/status, tail a log, tmux ls/kill

First, adapt it to this setup (once)

If the CONFIG block at the top of train.sh is still on its placeholder defaults (HOST=mybox, REPO=dev/myproject, LAUNCH='python train.py'), STOP and adapt it first: read ADAPTING.md in this directory and follow it. It tailors the connection, the launch command, and the "running" log signal to this box and workflow without breaking the pillars the skill rests on. Once train.sh gpus returns real GPUs and run points at the real launch command, the skill is ready and you operate it as below.

Workflow

train.sh gpus confirms the box is reachable and shows which cards are free.
train.sh run <gpu> <tag> <args> launches on <gpu> in a detached tmux session named <tag>, logging to ~/<RUNS>/<tag>.log. The args after <tag> are appended to the configured launch command. Pick a FREE GPU and leave capacity for others.
train.sh watch <tag> follows the run until it ends. In Claude Code, run it via the Monitor tool with persistent: true (a run outlasts Monitor's 1h cap). It prints RUNNING once the job is actually stepping and RUN ENDED when the tmux session dies. In another harness, drive the same command from that harness's own poll / notify loop.
The dashboard URL (wandb, etc.), if any, prints early: train.sh ssh 'tail -20 ~/<RUNS>/<tag>.log'.

RUN ENDED only means the session died: a crash and a clean finish look identical, so confirm from the log tail that the expected outputs were written. A dropped connection shows unreachable, retrying, never a false RUN ENDED, since the run is a session on the remote and keeps going.

The pillars this skill rests on, and how to tailor it to a different box, workflow, or harness, are in ADAPTING.md.

remote-gpu-train

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

remote-gpu-train

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Remote GPU training

First, adapt it to this setup (once)

Workflow

Similar Skills

Remote GPU training

First, adapt it to this setup (once)

Workflow

Similar Skills