From nchc-cluster-skills
Use when debugging SLURM job failures, hangs, crashes, or unexpected behavior on TWCC/NCHC HPC clusters. Trigger on: job hang, timeout, CUDA error, OOM, segfault, NCCL timeout, srun error, exit code, node drain, GPU utilization 0%, deadlock, job cancelled, ImportError in container, slow training, or any "my job isn't working" question. Do NOT trigger for: job submission (use slurm-submission), or partition/pricing queries (use cluster-info).
How this skill is triggered — by the user, by Claude, or both
Slash command
/nchc-cluster-skills:slurm-debugThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Core principle:** When a job fails or hangs, **do not guess**. Route by `sacct` State + ExitCode first, then follow the matching branch.
Core principle: When a job fails or hangs, do not guess. Route by sacct State + ExitCode first, then follow the matching branch.
REQUIRED: Use cluster-info skill for partition specs, QoS limits, and known bad nodes data.
| Symptom | Likely cause | Section |
|---|---|---|
Traceback in .err log | Code / container / path error | Application Error |
signal 9 (kill -9) | CPU/RAM OOM | OOM |
CUDA out of memory in .err | GPU OOM | OOM |
| No output, no error, no progress | Hang / deadlock | Hang or Slow |
NODE_FAIL state | Hardware / node issue | Bad Node |
Job stays PENDING forever | QoS / resource mismatch | PENDING |
Get the job's State and ExitCode:
sacct -j <JOBID> --format=JobID,State,ExitCode,Elapsed,MaxRSS,NodeList -P
| State | ExitCode | Go to |
|---|---|---|
| FAILED | 1:0 or 2:0 | → Application Error |
| FAILED | 137:0 (signal 9) | → OOM |
| TIMEOUT / CANCELLED (signal 15) | → Hang or Slow | |
| OUT_OF_MEMORY | → OOM (SLURM memory limit) | |
| NODE_FAIL | → Bad Node | |
| PENDING (never starts) | → PENDING |
Read .err log for the last traceback. Check in order:
git log since last working run--bind / --mount covers all needed pathsFirst distinguish GPU OOM (CUDA error in .err) vs CPU/RAM OOM (signal 9, no CUDA error).
GPU OOM fixes: reduce batch size, enable gradient checkpointing, use mixed precision, offload optimizer states.
CPU/RAM OOM — check these in order:
HF_HOME, pre-download, TRANSFORMERS_OFFLINE=1num_workers — each worker forks and may duplicate dataset memoryThis is the hardest category. Follow in order:
Check voluntary_ctxt_switches in /proc/<PID>/status twice, 30 seconds apart.
/tmp usage on the node — NCHC cluster has no SLURM Prolog/Epilog cleanup;
/tmp garbage accumulates from all users indefinitely. Usage >50% causes silent
deadlocks. Workaround: add /tmp cleanup of your own cache patterns at job start.
If another user's garbage fills it, exclude the node./dev/shm — DataLoader shared memory can fill updmesg — check for hardware errors (ECC, GPU Xid)fork() after JAX/CUDA init → use spawn multiprocessing start methodRule of thumb: Same code works on some nodes but hangs on others → it's the node, not the code.
sacct NodeListdrained/down in sinfo = known issue, already excluded by schedulermixed/idle doesn't mean healthy — /tmp garbage, GPU errors can exist on SLURM-healthy nodes--exclude and update known bad nodes record with date and symptomCheck the (Reason) from squeue -j <JOBID>, then cross-reference against QoS limits
and partition constraints from cluster-info skill:
| Reason | Check |
|---|---|
QOSMinGRESNotSatisfied | Requested GPUs below QoS MinGPU/Job — check cached QoS limits |
QOSMaxGRESPerUser | Total GPUs across running jobs exceeds QoS MaxGPU/User limit |
QOSMaxJobsPerUserLimit | Too many concurrent jobs |
Resources | Request doesn't fit partition (GPUs, nodes, wall time) |
Common mistakes:
--time exceeds partition MaxWall--nodes exceeds partition MaxNodes/tmp cleanup added to job script?--exclude updated for bad nodes?--time / --mem adjusted if limits were hit?Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub nchc-bio/nchc-marketplace --plugin nchc-cluster-skills