From sagemaker-ai
Debug-only skill that identifies and classifies Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod clusters.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-slurm-debuggerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on
Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on HyperPod Slurm clusters. Do not run, recommend, or print any state-mutating command. For remediation, link to the official AWS or Slurm documentation.
Invoke when the user reports any of the symptoms in the decision table.
Orchestrator.Eks — invoke hyperpod-node-debugger or hyperpod-nccl.hyperpod-node-debugger.hyperpod-nccl.hyperpod-ssm.Canonical recovery URLs: references/slurm-details.md → Authoritative recovery documentation.
sagemaker:DescribeCluster, sagemaker:ListClusterNodesssm:StartSession on the HyperPod-created SSM documentjq ≥ 1.6.unbuffer (from the expect package). Required — without it aws ssm start-session
returns empty stdout intermittently with Cannot perform start session: EOF and every
check silently misreports. Install: expect package on Amazon Linux / RHEL / Debian /
Ubuntu / macOS. Script exits at prerequisite check if missing.Ask the user for:
aws sagemaker describe-cluster --cluster-name <NAME/ARN> --region <REGION> \
--query 'Orchestrator' --output json
If Orchestrator.Eks is present, stop. Route per When NOT to invoke.
bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION>
# Scope to a node:
bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION> --node <SLURM_NODE>
Relay the script output to the user verbatim.
For each finding, look up the section in the decision table and link the user to the corresponding AWS / Slurm doc. Do not type out remediation commands.
Symptom (sinfo -o "%N %T %30E" or script finding) | Section |
|---|---|
Node state = down or down*, reason other than below | A: Node Down |
Node state = down*, Reason = Node unexpectedly rebooted | B: Unexpected Reboot |
Jobs PENDING with REASON=Resources while nodes are idle | C: Controller State |
Jobs stuck COMPLETING after node replacement | C: Controller State |
scontrol ping returns DOWN for the controller | C: Controller State |
| GRES (GPU) counts incorrect or not released | C: Controller State |
state=fail issued but no recovery occurred | D: Action Reason Mismatch |
Accounting errors or RPC errors mentioning dbd | C: Controller State (slurmdbd) |
slurm.conf edited; new partitions or nodes not visible | C: Controller State (config) |
| Job exited on a hardware failure but did not restart | E: Auto-resume |
| Behavior | Default | Override |
|---|---|---|
| Mode | read-only — always; no remediation flag exists | n/a |
| Region | $AWS_DEFAULT_REGION, falling back to us-east-1 | --region <R> |
| Scope | all nodes in down / drain / fail / "unexpectedly rebooted" | --node <SLURM_NODE_NAME> |
| Output | colorized terminal | --no-color |
| SSM target format | sagemaker-cluster:<clusterId>_<instanceGroupName>-<instanceId> (derived) | n/a |
| Controller discovery | --controller-group (if set) → SlurmConfig.NodeType=Controller → provisioning_parameters.json | --controller-group <N> |
| Failure | Skill behavior | Required user action |
|---|---|---|
describe-cluster fails | Print AWS error; exit 1 | Fix credentials/region; verify cluster name |
Cluster has Orchestrator.Eks | Exit 1 with pointer to EKS-side skills | Use hyperpod-node-debugger or hyperpod-nccl |
session-manager-plugin missing / SSM unreachable | sinfo returns empty; exit 1 | Install plugin; verify node InService |
Disk ≥ 95 % full on a down node | Report finding disk-full-<node> | Refer to AWS troubleshooting docs |
Missing jq or aws | Exit 1 at prerequisite check | Install per Prerequisites |
Node is down because slurmd stopped responding. Causes: slurmd crash, disk full,
OOM, network partition, hardware fault.
Script checks: systemctl is-active slurmd, srun -w <NODE> hostname (RPC layer), disk,
memory.
If node returns to down after a manual resume → escalate to hyperpod-node-debugger.
Context: references/slurm-details.md § A.
Node is down* with Reason "Node unexpectedly rebooted" because slurmd
re-registered after an out-of-band reboot. Upstream Slurm behavior, not HyperPod.
Node is typically healthy.
Links:
state=resume semantics)If node reboots again within minutes → escalate to hyperpod-node-debugger.
Context: references/slurm-details.md § B.
slurmctld in-memory state can desync from the on-disk state. A controller restart reloads from StateSaveLocation and clears bad caches. User decides and executes.
Restart may help:
| Symptom | Why |
|---|---|
PENDING with REASON=Resources, idle nodes | Re-evaluates the queue |
Jobs stuck COMPLETING after node replacement | Controller held a reference to the old node |
| GRES (GPU, EFA) not released after a job ends | Resource accounting de-synced |
Nodes stuck Unknown after reboot, slurmd is up | Re-registration was not processed |
scontrol ping times out | Controller event loop is hung |
Lost connection to slurmdbd / RPC errors | DBD connection wedged |
Do NOT restart when:
Action:Replace) in progress on any node — concurrent changes
fail the replacement.slurmd on that node.sinfo and squeue are responsive — problem is elsewhere.journalctl -u slurmctld not reviewed yet — panic / OOM will reproduce.slurm.conf was just edited — try scontrol reconfigure first.sacct fails, accounting fields show Unknown,
controller log spams Unable to contact slurmdbd. Restore slurmdbd before
considering controller restart.
https://slurm.schedmd.com/accounting.html ·
details.slurm.conf / topology.conf mtime > slurmctld start.
scontrol reconfigure first; restart is fallback.
https://slurm.schedmd.com/scontrol.html ·
details.Restart procedure / what's preserved:
Context: references/slurm-details.md § C.
scontrol update state=fail reason=... was issued with a reason that does not match
Action:Reboot or Action:Replace exactly. HyperPod silently ignores anything else.
Script detects near-misses on nodes in fail state.
Required strings (case-sensitive, no whitespace, no punctuation):
Action:RebootAction:ReplaceContext: references/slurm-details.md § Action reason-string validation.
--auto-resume=1 is an srun step option. It re-runs the step after HMA (the Health
Monitoring Agent) flags a node and Automatic node recovery replaces it.
Why it didn't restart the job:
sbatch not srun — per-step; sbatch directives are silently ignored.NodeRecovery is None — faulty nodes are labeled but not replaced.Link: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-auto-resume.html
Context: references/slurm-details.md § HyperPod auto-resume.
| Condition | Next skill |
|---|---|
Node returns to down shortly after a manual resume | hyperpod-node-debugger (hardware) |
slurmd logs contain CUDA / NVIDIA / XID errors | hyperpod-node-debugger § G |
Disk full or /dev/shm exhausted | hyperpod-node-debugger § I |
| Node unreachable via SSM | hyperpod-ssm |
Controller restart does not clear COMPLETING after 2 attempts | hyperpod-issue-report + AWS Support |
npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
Provisions and manages on-demand or reserved GPU clusters (H100, H200, B200) on Together AI with Kubernetes or Slurm orchestration, shared storage, and credential management for ML and HPC workloads.
Diagnoses and fixes Kubernetes pod failures like CrashLoopBackOff, Pending, DNS, networking, storage mounts, and rollout issues using kubectl workflows and scripts.