From nchc-cluster-skills
Use when needing TWCC/NCHC cluster specs, partition info, QoS limits, pricing, or architecture details. Trigger on: sinfo, scontrol, partition, QoS, MinGPU, GPU type, SU billing, NTD cost, TWCC pricing, cluster identification, ARM vs x86, GB200 compatibility. This is the shared data layer — slurm-submission and slurm-debug both depend on it.
How this skill is triggered — by the user, by Claude, or both
Slash command
/nchc-cluster-skills:cluster-infoThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Queries, caches, and serves cluster specs for other skills. Partition layouts, QoS limits, and
Queries, caches, and serves cluster specs for other skills. Partition layouts, QoS limits, and pricing differ between cluster generations — so specs are cached in memory and refreshed when stale.
Do not assume which cluster the user is on. Clusters share similar setups but have different partitions, GPUs, QoS limits, and node names. Guessing wrong leads to failed jobs or wrong advice, especially when users transfer data between clusters.
Valid cluster names: t3-c4, nano4, nano5, twcc, etc. — these are well-known identifiers
that the user or documentation would use.
How to determine the cluster:
Do NOT use these to guess the cluster name:
hostname — returns login node names like 25a-lgn03 which are NOT cluster namesscontrol show config | grep ClusterName — returns generic hpc on all clustersNever proceed to Step 2 without a confirmed cluster name. The cached partition table is per-cluster — using the wrong cluster's cache is worse than having no cache.
Partition layouts are stable for weeks or months at a time. The partition table is stored in Claude's memory and reused across sessions to avoid unnecessary re-querying.
1. Check memory for a stored cluster info entry
(look for an entry labelled "TWCC/NCHC cluster info").
Found AND matches current cluster AND no staleness signals?
→ Use the cached table. Done.
Not found, OR wrong cluster, OR staleness signal present?
→ Run the live query below, then save result to memory (replace old entry).
Staleness signals — refresh the cache if any apply:
Run all four commands. Each provides fields that the others do not.
# 1. Hardware layout: partition, CPUs, memory, GRES, max wall, node count, features (arch)
sinfo -o "%20P %8c %8m %30G %10l %6D %20f" --noheader
# 2. Partition detail: min/max nodes, default time, QoS name, preempt mode
# NOTE: scontrol does NOT show MinTRES — that's only in sacctmgr (command 3).
# Use the QoS field here to map each partition → its QoS → the QoS limits.
scontrol show partition
# 3. QoS limits: min/max GPUs per job, max jobs per user, max wall time
# **This is the ONLY source for MinTRES (minimum GPU requirement).**
# MinTRES column shows e.g. "gres/gpu=64" meaning min 64 GPUs per job.
# Cross-reference: partition → QoS name (from command 2) → this table.
sacctmgr show qos format=Name,Priority,MinTRESPerJob,MaxTRESPerUser,MaxTRESPerJob,MaxWall,MaxJobsPerUser -P
# 4. Account association limits (user-specific)
sacctmgr show assoc where user=$USER \
format=Account,Cluster,Partition,GrpTRES,MaxTRES,MaxJobs,GrpTRESRunMins -P
How to map MinGPU to partitions: Each partition has a QoS (from scontrol show partition).
Look up that QoS in the sacctmgr show qos output to get MinTRES. If MinTRES is empty, there
is no minimum GPU requirement. Example: partition normal → QoS p_normal → MinTRES gres/gpu=64
→ MinGPU = 64.
Key fields to extract per partition:
| Field | Source | Meaning |
|---|---|---|
| Partition name | %P from sinfo | Queue name |
| GPU type + count | %G from sinfo | e.g. gpu:h200:8 |
| MinGPU/job | MinTRESPerJob from sacctmgr QoS (mapped via partition→QoS) | Minimum GPUs required to submit |
| MaxGPU/job | MaxTRESPerJob from sacctmgr QoS | Maximum GPUs per job (empty = no limit) |
| MaxGPU/user | MaxTRESPerUser from sacctmgr QoS | Maximum total GPUs across all running jobs |
| Max wall time | %l from sinfo / MaxTime from scontrol | Hard job time limit |
| MaxJobs/user | sacctmgr QoS | Maximum concurrent jobs per user |
| Node count | %D from sinfo | Total nodes in partition |
| Memory/node | %m from sinfo | RAM per node (MB) |
| CPUs/node | %c from sinfo | CPU cores per node |
| Min/MaxNodes | scontrol | Node count constraints per job |
| Architecture | features col from sinfo | aarch64 = ARM, else x86_64 |
| Preempt mode | scontrol | e.g. REQUEUE |
You MUST save the query results to memory immediately after querying. This is not optional. The entire caching system depends on it — if you skip this step, every future session will re-query from scratch and other skills (slurm-submission, slurm-debug) will have no data.
Save a single consolidated table that includes both hardware specs AND scheduling rules. Replace any previous entry — one memory file per cluster.
TWCC/NCHC cluster info (queried YYYY-MM-DD, cluster: <name>)
Partition GPU GPUs/node MinGPU MaxGPU/job MaxGPU/user CPUs/node Mem/node Max wall MaxJobs/user Nodes Arch Notes dev ... ... — — 16 ... ... 1:00:00 — ... x86 example row normal ... ... 64 — 64 ... ... 12:00:00 — ... x86 MinGPU from QoS
—= no limit or not set....= fill from query results.Node ranges:
- <GPU type>:
<hostname-prefix>[range]User accounts: (from sacctmgr assoc query)
Known bad nodes: (maintain as failures are observed)
Node Issue Date Added Expiry Entries older than 7 days should be re-verified before continuing to exclude.
Include the query date so staleness can be assessed in future sessions. Do not accumulate multiple versions — replace the old entry.
# Remaining SU budget
sacctmgr show assoc where user=$USER format=Account,GrpTRESRunMins -P
# Current queue
squeue -u $USER -o "%.18i %.9P %.30j %.8u %.8T %.10M %.10l %.6D %R"
Do not quote hardcoded prices. Pricing changes between cluster generations. Fetch from the official NCHC pricing page before quoting any cost figures:
If neither URL is fetchable, ask the user to confirm the current GPU-hour rate for their GPU type and project category before quoting costs.
Billing formula (consistent across generations):
GPU-hours = execution_hours * gpus_requested
cost (NTD) = GPU-hours * rate_per_gpu_hour
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub nchc-bio/nchc-marketplace --plugin nchc-cluster-skills