Skill

managing-jobs

From hpc

Submit, monitor, cancel, array, and chain Slurm jobs on the Yale SOM HPC cluster. TRIGGER when writing sbatch scripts for the Yale SOM HPC cluster, choosing cluster partitions/resources, using job arrays or dependencies on the cluster, or running sacct/squeue/scancel against cluster jobs.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/hpc:managing-jobs

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Rule: test small, request explicitly, monitor results, then scale.

SKILL.md

140 lines · ~1.3k tokens

Stats

Parent stars1

MaintenanceExcellent

Last CommitApr 29, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Managing Jobs

Rule: test small, request explicitly, monitor results, then scale.

Basic commands

sbatch job.sh                    # submit batch job
squeue -u $USER                  # current jobs
sacct -j 12345                   # completed job accounting
scancel 12345                    # cancel your job
scontrol show job 12345          # detailed job state
sinfo -s                         # partition summary

Safe Slurm template

#!/bin/bash
#SBATCH --job-name=analysis
#SBATCH --partition=default_queue
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --output=logs/%x_%j.out

set -euo pipefail

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OPENBLAS_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

cd /gpfs/project/myproject/code
uv run python src/analysis.py

Interactive session

Use only for debugging, never for long unattended work. Interactive jobs hold resources whether or not you are typing — a forgotten srun --pty bash can sit on a GPU all weekend.

srun --partition=cpunormal --cpus-per-task=2 --mem=8G --time=01:00:00 --pty bash

Rules:

Smallest allocation that lets you debug. Two CPUs and 8 GB is usually enough.
--time measured in hours, not days. Re-request if you need more.
Exit (exit or Ctrl-D) the moment you are done. Do not minimize the terminal and walk away.
Anything that can run unattended belongs in sbatch, not srun --pty.

Job arrays

Use arrays for independent tasks. Throttle concurrency with %N.

#!/bin/bash
#SBATCH --job-name=array-example
#SBATCH --partition=default_queue
#SBATCH --array=1-500%50
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --output=logs/%x_%A_%a.out

set -euo pipefail

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OPENBLAS_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

uv run python src/run_task.py --task-id "${SLURM_ARRAY_TASK_ID}"

Do not submit 500 tasks all at once unless you mean to occupy the cluster. Use %50 or smaller.

Politeness rule: leave room at the table. If your array can run at 50-way concurrency instead of 500-way, choose 50. Other people are using the same nodes, GPUs, GPFS metadata servers, and queue. Throttling your own work usually also makes debugging easier.

Dependencies

prep=$(sbatch --parsable slurm/01_prepare.sh)
est=$(sbatch --parsable --dependency=afterok:${prep} slurm/02_estimate.sh)
sbatch --dependency=afterok:${est} slurm/03_tables.sh

Use afterany for cleanup or restart logic that should run even after failure.

Time limits

Shorter jobs often schedule faster because Slurm can backfill them into idle slots between bigger jobs. Multi-day jobs queue behind everyone. If work is resumable, prefer 2–4 hour chunks; with skip-if-exists outputs, a killed job picks up where it left off on resubmit.

Right-size before submitting

Do not pad requests "just in case." Over-requesting blocks scheduling for everyone, and the cluster is moving toward enforced per-user caps. The right-sizing loop:

Submit a 10-minute test job with a small input.
Run seff JOBID after it finishes.
Set the real job's --mem to ~1.5–2× the test's MaxRSS, not 10×.
Set --cpus-per-task to what your code actually parallelizes over (SLURM_CPUS_PER_TASK controls BLAS, multiprocessing, setDTthreads, set processors).
Set --time from a sample-data extrapolation, not from "what if it takes a week."

See self-diagnosing resource use for the post-job checks that drive this loop.

Before scaling up

sbatch slurm/test.sh
squeue -u $USER
sacct -j JOBID --format=JobID,Elapsed,MaxRSS,AllocCPUS,TotalCPU,State

Checklist

Job starts with a small test.
--time, --mem, and --cpus-per-task are explicit.
Thread variables are set with ${SLURM_CPUS_PER_TASK:-1}.
Arrays use a concurrency throttle like %50.
Output paths include job IDs or task IDs.
Resource usage is checked after completion.

managing-jobs

Popularity

Invocation

Context Preview

SKILL.md

managing-jobs

Popularity

Invocation

Context Preview

SKILL.md

Managing Jobs

Basic commands

Safe Slurm template

Interactive session

Job arrays

Dependencies

Time limits

Right-size before submitting

Before scaling up

Checklist

Further reading

Similar Skills

Managing Jobs

Basic commands

Safe Slurm template

Interactive session

Job arrays

Dependencies

Time limits

Right-size before submitting

Before scaling up

Checklist

Further reading

Similar Skills