By NCHC-bio
Claude Code skills for TWCC / NCHC HPC cluster usage: SLURM job submission, debugging, GPU allocation, pricing, and post-job review
Use when setting up, porting, or debugging a job on any NCHC/TWCC partition whose compute nodes are aarch64 / ARM64 while the login node is x86_64. The current example is GB200 (gb200-rack1, gb200-rack2, gb200-dev, gb200-full; Grace CPU), and this skill will apply to any future ARM-based partition the cluster adds. Scope is strictly aarch64/arch-mismatch issues. Trigger on: user wants to run on an ARM / aarch64 / Grace / Grace Hopper node, "Exec format error", "cannot execute binary file", torch CUDA False on ARM, torchcodec on ARM, libavutil missing, FFmpeg on ARM, uv on ARM node, conda/virtualenv leakage to ARM node, "my pipeline works on login but fails on the ARM node". Do NOT trigger for: generic SLURM submission (use slurm-submission), non-interactive auth / HF token / wandb in batch (use slurm-submission), partition specs, preemption, QoS (use cluster-info), tmpfs / cgroup OOM / SIGKILL / hangs (use slurm-debug).
Use when needing TWCC/NCHC cluster specs, partition info, QoS limits, pricing, or architecture details. Trigger on: sinfo, scontrol, partition, QoS, MinGPU, GPU type, SU billing, NTD cost, TWCC pricing, cluster identification, ARM vs x86, GB200 compatibility. This is the shared data layer — slurm-submission and slurm-debug both depend on it.
Use when debugging SLURM job failures, hangs, crashes, or unexpected behavior on TWCC/NCHC HPC clusters. Trigger on: job hang, timeout, CUDA error, OOM, segfault, NCCL timeout, srun error, exit code, node drain, GPU utilization 0%, deadlock, job cancelled, ImportError in container, slow training, or any "my job isn't working" question. Do NOT trigger for: job submission (use slurm-submission), or partition/pricing queries (use cluster-info).
Use when submitting SLURM jobs, writing sbatch scripts, choosing GPU count, or reviewing completed jobs on TWCC/NCHC clusters. Trigger on: sbatch, job submission, GPU allocation, DDP, multi-GPU, wall time, job template, seff, post-job review, HuggingFace token in batch job, wandb login in batch, HF 401 in job, non-interactive auth. Do NOT trigger for: partition queries or pricing (use cluster-info), or job failures/hangs (use slurm-debug). If the target partition has aarch64/ARM compute nodes (currently any gb200-*, and any future ARM partition), ALSO load arm64-pipeline for aarch64-specific setup.
Use when the user asks to prove a bug claim by running code — "can you verify this?", "is that actually a bug?", "prove it", "run it and check". Agents often hallucinate bugs; this skill forces the claim through a falsification test — propose a patch, run HEAD and patch against the real entry point, compare outputs. No behavior change = claim invalidated. Do NOT trigger for general discussion or analysis where no bug claim is being tested.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Claude Code skills plugin for working with TWCC / NCHC (National Center for High-performance Computing) HPC clusters.
| Skill | Triggers on |
|---|---|
cluster-info | sinfo, scontrol, partition specs, QoS limits, MinGPU, SU billing, pricing, ARM vs x86, cluster identification |
slurm-submission | sbatch, job submission, GPU allocation, DDP, multi-GPU, wall time, job template, post-job review |
slurm-debug | Job hang, timeout, CUDA error, OOM, segfault, NCCL timeout, exit code, node drain, GPU utilization 0%, deadlock, job cancelled, slow training |
verify-before-claiming | User explicitly asks to verify, reproduce, benchmark, or compare behavior by running code ("can you verify this?", "run it and check", "prove it") |
Open Claude Code by running claude in your terminal, then run the following commands:
/plugin marketplace add NCHC-bio/nchc-marketplace
/plugin install nchc-cluster-skills@nchc-marketplace
To pull the latest version after a release:
/plugin marketplace update nchc-marketplace
/plugin update nchc-cluster-skills@nchc-marketplace
claude --plugin-dir /path/to/nchc-cluster-skills
/reload-plugins
Shared data layer for cluster specs, used by both slurm-submission and slurm-debug:
sinfo/scontrol/sacctmgr once, stores in memory, refreshes when staleGuides Claude through job submission (depends on cluster-info for cached specs):
seff, utilization log analysis, actionable next-step tableGuides Claude through a structured debug flow when SLURM jobs fail or hang:
sacct State + ExitCode, then follows the matching branch/tmp, /dev/shm, dmesg) before code investigation--exclude listssqueue reasons to QoS limits and partition constraintsOpt-in skill for when the user explicitly asks Claude to prove a behavioral claim by running code:
switch_*.py + run_*.sh + parse_*.py + per-variant logs under .verify_<slug>/slurm-submissionnpx claudepluginhub nchc-bio/nchc-marketplace --plugin nchc-cluster-skillsMCP server for Slurm workload management and HPC job scheduling
Submit GPU compute jobs to a Slurm cluster: sbatch script generation, GPU type selection, log management.
SkyPilot agent skill for launching cloud VMs, Kubernetes pods, and Slurm jobs across 25+ clouds
Claude Code skill pack for Vast.ai (24 skills)
AI-assisted inference on NVIDIA DGX Spark - run, manage, and stop LLM workloads
Kubernetes cluster efficiency analysis: resource utilization, Karpenter, OOM, workloads