From hpc
Submit, monitor, cancel, array, and chain Slurm jobs on the Yale SOM HPC cluster. TRIGGER when writing sbatch scripts for the Yale SOM HPC cluster, choosing cluster partitions/resources, using job arrays or dependencies on the cluster, or running sacct/squeue/scancel against cluster jobs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/hpc:managing-jobsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Rule: test small, request explicitly, monitor results, then scale.
Rule: test small, request explicitly, monitor results, then scale.
sbatch job.sh # submit batch job
squeue -u $USER # current jobs
sacct -j 12345 # completed job accounting
scancel 12345 # cancel your job
scontrol show job 12345 # detailed job state
sinfo -s # partition summary
#!/bin/bash
#SBATCH --job-name=analysis
#SBATCH --partition=default_queue
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --output=logs/%x_%j.out
set -euo pipefail
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OPENBLAS_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
cd /gpfs/project/myproject/code
uv run python src/analysis.py
Use only for debugging, never for long unattended work. Interactive jobs hold resources whether or not you are typing — a forgotten srun --pty bash can sit on a GPU all weekend.
srun --partition=cpunormal --cpus-per-task=2 --mem=8G --time=01:00:00 --pty bash
Rules:
--time measured in hours, not days. Re-request if you need more.exit or Ctrl-D) the moment you are done. Do not minimize the terminal and walk away.sbatch, not srun --pty.Use arrays for independent tasks. Throttle concurrency with %N.
#!/bin/bash
#SBATCH --job-name=array-example
#SBATCH --partition=default_queue
#SBATCH --array=1-500%50
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --output=logs/%x_%A_%a.out
set -euo pipefail
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export MKL_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OPENBLAS_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
uv run python src/run_task.py --task-id "${SLURM_ARRAY_TASK_ID}"
Do not submit 500 tasks all at once unless you mean to occupy the cluster. Use %50 or smaller.
Politeness rule: leave room at the table. If your array can run at 50-way concurrency instead of 500-way, choose 50. Other people are using the same nodes, GPUs, GPFS metadata servers, and queue. Throttling your own work usually also makes debugging easier.
prep=$(sbatch --parsable slurm/01_prepare.sh)
est=$(sbatch --parsable --dependency=afterok:${prep} slurm/02_estimate.sh)
sbatch --dependency=afterok:${est} slurm/03_tables.sh
Use afterany for cleanup or restart logic that should run even after failure.
Shorter jobs often schedule faster because Slurm can backfill them into idle slots between bigger jobs. Multi-day jobs queue behind everyone. If work is resumable, prefer 2–4 hour chunks; with skip-if-exists outputs, a killed job picks up where it left off on resubmit.
Do not pad requests "just in case." Over-requesting blocks scheduling for everyone, and the cluster is moving toward enforced per-user caps. The right-sizing loop:
seff JOBID after it finishes.--mem to ~1.5–2× the test's MaxRSS, not 10×.--cpus-per-task to what your code actually parallelizes over (SLURM_CPUS_PER_TASK controls BLAS, multiprocessing, setDTthreads, set processors).--time from a sample-data extrapolation, not from "what if it takes a week."See self-diagnosing resource use for the post-job checks that drive this loop.
sbatch slurm/test.sh
squeue -u $USER
sacct -j JOBID --format=JobID,Elapsed,MaxRSS,AllocCPUS,TotalCPU,State
--time, --mem, and --cpus-per-task are explicit.${SLURM_CPUS_PER_TASK:-1}.%50.#SBATCH directive, output filename patterns, signal handling.--array syntax, throttling (%N), SLURM_ARRAY_* env vars.npx claudepluginhub yale-som-hpc/claude-code-marketplace --plugin hpcSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.