From sagemaker-ai
Diagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-node-debuggerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Operating policy.** Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a **Suggested command (run this yourself)** block and wait for the customer. Destructive order: **investigate → reboot → replace** (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Never discard training state, logs...
Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Never discard training state, logs, or caches on speculation.
IaC note (always include with mutation commands). When you suggest any command that changes cluster, VPC, SG, subnet, or EKS configuration (e.g. authorize-security-group-*, modify-vpc-attribute, update-cluster, kubectl label/cordon/drain, create namespace, set env daemonset), ask the customer first whether the cluster / VPC / SG is managed by Infrastructure-as-Code (CloudFormation, CDK, Terraform, Pulumi). If yes, tell them: "Apply this change in your IaC source first, then deploy through the pipeline — running the command directly will drift from your template and the next stack update may overwrite it." If they need to fix the issue immediately and the IaC change will follow, flag the drift explicitly so they remember to reconcile.
Read-only triage. scripts/triage-cluster.sh (and helpers check-efa-sg.sh, check-node-reachability.sh, check-vpc-config.sh) read state and print each issue as [FAIL] ... → references/node-diagnostics-detail.md § <section>. Catalog of customer-ticket patterns: references/node-issue-catalog.md.
scripts/triage-cluster.sh (add --node <INSTANCE-ID> to focus one node).[FAIL] / issue entry, Read the referenced section.bash scripts/triage-cluster.sh --cluster <CLUSTER_NAME_OR_ARN> --region <REGION>
# Focus on one node:
bash scripts/triage-cluster.sh --cluster <CLUSTER_NAME_OR_ARN> --region <REGION> --node <INSTANCE_ID>
One pass collects: cluster status + NodeRecovery, events, per-node health (HyperPod + EKS labels, Slurm states), VPC/SG snapshot, CloudWatch availability, SSM readiness, on-node resource checks (disk, memory, /dev/shm, OOM, NVMe, time sync, SSM agent), Slurm node→instance mapping.
Tags: [PASS] passed · [FAIL] issue with a → references/... pointer · [WARN] advisory · [INFO] informational. Priorities: P0 blocks operation · P1 degraded · P2 informational.
Events (list-cluster-events) — provisioning-time:
| Event | Section |
|---|---|
"EFA health checks did not run successfully" (public-doc verbatim signal) | A: EFA/SG |
| Instance bootstrap or network-misconfiguration event | A + B: VPC |
| Lifecycle-script failure or timeout | D: Lifecycle |
| Insufficient-capacity or AZ-mismatch failure at creation | C: Capacity |
Hardware failure / UnschedulablePendingReplacement | F: Hardware |
EKS labels:
| Label | Section |
|---|---|
node-health-status: UnschedulablePendingReplacement | F |
node-health-status: UnschedulablePendingReboot | F |
deep-health-check-status: Failed | G → F |
Symptoms:
| Symptom | Section |
|---|---|
| Training hangs at NCCL init / AllReduce | A → E |
Slurm node down / "Node unexpectedly rebooted" | H: Slurm |
| Jobs stuck PENDING / COMPLETING | H |
| Auto-repair not triggering | F |
| GPU not visible / XID / ECC errors | G |
| GPU row-remap pending/failed / silent NaNs / DCGM Fail | G § G.1.a/b |
Disk full / OOM / "Cannot allocate memory" | I: Resources |
| Wrong vCPU count (e.g. 96 instead of 192 on p5.48xlarge) | J: Config |
| Container CrashLoopBackOff / runtime crash | M: Container Runtime |
aws-node CrashLoopBackOff / gRPC 50051 refused | O: CNI / Pod Networking |
| Pods stuck Pending with no IP / CNI error | O |
DNS resolution / enableDnsSupport | B § B.2 |
| Public subnet / IGW misconfigured | B § B.3 |
| Missing VPC endpoints (ECR / STS / FSx) | B § B.4 |
| EKS VPC / SG mismatch with HyperPod | B § B.5 |
| Kernel panic / watchdog / hung task | N: Kernel |
| Need shell on a node | K: SSM |
| Collect logs for AWS Support | L: Log Collection |
Per the HyperPod prerequisites doc, the SG must allow all inbound and outbound to itself. scripts/check-efa-sg.sh validates self-ref rules on every cluster SG. On-node EFA check via scripts/check-node-reachability.sh over SSM. Full: § A.
SG/subnet VPC mismatch, missing S3 Gateway endpoint, EKS auth mode, worker→controller routing, VPC DNS support, private-subnet + NAT / VPC endpoints, EKS↔HyperPod VPC alignment. scripts/check-vpc-config.sh. Full: § B.
Insufficient-capacity failure at creation, or no subnets in the AZ where capacity is available. Check AZ offerings via describe-instance-type-offerings, then change subnet AZ or use Flexible Training Plans / ODCR. Full: § C.
Surfaced in cluster events + CloudWatch under LifecycleConfig/<group>/<instance-id>. Common: S3 connectivity, IAM gaps, CRLF line endings, infinite loops, parameter-name mismatch. Full: § D.
Delegate to hyperpod-version-checker to compare NVIDIA driver, CUDA, NCCL, EFA installer, OFI NCCL, PyTorch across nodes. Ensure job env has FI_PROVIDER=efa, FI_EFA_USE_DEVICE_RDMA=1, NCCL_SOCKET_IFNAME=^lo,docker. Full: § E.
Confirm NodeRecovery=Automatic, inspect the EKS health labels + sagemaker.amazonaws.com/fault-details annotation, and read the SagemakerHealthMonitoringAgent/<group>/<instance> CloudWatch stream. HMA runs passive background checks on GPU and Neuron state and reboots the node on count mismatch (per the HMA doc: "if there's a mismatch between the expected number of GPUs … and the count returned by nvidia-smi, then HMA reboots the node"; same for neuron-ls). Manual recovery order: reboot first, replace only if reboot fails; the preferred path is the batch APIs (BatchReboot/BatchReplaceClusterNodes). Full: § F · patterns: node-issue-catalog.md.
NVIDIA (p4d/p5/g5/g6): nvidia-smi + dmesg over SSM for Xid, ECC, thermal throttling. Xid classification per NVIDIA's catalog: 13 Graphics Engine Exception (application-level), 31 GPU memory page fault (application, can be driver/HW), 63 GPU memory remapping event (HW/ECC), 71 CE4 Error (HW copy engine), 74 NVLink Error (HW), 79 GPU has fallen off the bus (PCIe bus), 109 Context Switch Timeout Error (HW). Any uncorrectable ECC → drain and replace. Row-remap state is the authoritative silent-degradation signal (§ G.1.a).
Trainium / Inferentia (trn1/trn2/inf2): Neuron SDK — neuron-ls, neuron-top, neuron-monitor. nvidia-smi does not apply.
GPU / accelerator failures flow into § F for reboot / replace. Full: § G.
Node down/unresponsive, unexpected reboots, stuck PENDING/COMPLETING jobs, Slurm-to-instance-ID translation. Primary access is SSM; diagnose slurmd first, fix the root cause, then start/resume the node per § H. Full: § H.
Disk full (HyperPod root volume defaults to 100 GB and is not intended to grow post-creation), OOM, os.fork() memory error, /dev/shm exhaustion, inode exhaustion. Fork-memory fix: export FI_EFA_USE_HUGE_PAGE=0. Redirect bulk data to /opt/sagemaker (secondary EBS) or /opt/dlami/nvme (instance store). Full: § I.
p5.48xlarge reports 96 vCPU instead of 192 → set ThreadsPerCore=2 via update-cluster. Full: § J.
No direct SSH on HyperPod. Target format sagemaker-cluster:<CLUSTER_ID>_<GROUP>-<INSTANCE_ID>. Failures: plugin missing, wrong prefix, IAM, VPC endpoints. Full: § K.
Delegate to hyperpod-issue-report for S3-stored bundles. Key CloudWatch streams: LifecycleConfig/<group>/<instance-id>, SagemakerHealthMonitoringAgent/<group>/<instance-id>. Full: § L.
CrashLoopBackOff, OOMKilled, ImagePullBackOff, RunContainerError on EKS. kubectl describe pod + on-node crictl ps -a, journalctl -u containerd. Full: § M.
Kernel panic, watchdog timeout, soft lockup, unexpected reboots not explained by HyperPod health monitoring. dmesg | grep -iE 'panic|watchdog|hung_task|NMI' + journalctl -b -1. nvrm-related signatures point at NVIDIA driver crashes. Full: § N.
VPC CNI (aws-node) failures, IPAMD errors, gRPC 127.0.0.1:50051 refused, pods stuck Pending with FailedCreatePodSandBox. Script auto-checks aws-node, kube-proxy, CoreDNS. Full: § O.
aws CLI v2, recent enough to support the HyperPod cluster commands (describe-cluster, list-cluster-nodes, batch-reboot-cluster-nodes, batch-replace-cluster-nodes)python3, bash 4+ (associative arrays are required by the scripts)kubectl authenticated to the EKS cluster (K8s checks skipped if absent)session-manager-plugin for on-node hardware checksunbuffer (from the expect package) — optional; if missing, SSM on-node probes are skipped while the rest of the triage still runs. Install via yum install expect / apt install expect.--region or set $AWS_DEFAULT_REGION.--node <ID> focuses one.--no-color to force off.| Failure | Script | Tell the customer |
|---|---|---|
aws sts get-caller-identity fails | Exit 1 | "Fix AWS credentials and rerun." |
describe-cluster fails | Exit 1 after listing region's clusters | "Confirm cluster name and region." |
sagemaker:* / ec2:* / logs:* AccessDenied | Warn, add Missing IAM permission for <API>, continue | "Grant the listed IAM action and rerun." |
kubectl absent or unauthenticated | Skip K8s checks | "Install/authenticate kubectl (see § K)." |
session-manager-plugin absent | Skip on-node probes | "Install session-manager-plugin (see § K)." |
SSM start-session fails or times out (180s) | Mark node unreachable with → § K pointer | "Rerun with --node <ID> to isolate; verify SSM agent on the node." |
| Cluster > 20,000 nodes | First 20,000 paginated; warn | "Use --node to target specific nodes." |
Exit codes: 0 triage complete · 1 cluster not found or fatal prerequisite missing.
Read-only diagnostic — covers triage-cluster.sh, check-efa-sg.sh, check-vpc-config.sh, and check-node-reachability.sh:
{
"Action": [
"sagemaker:DescribeCluster",
"sagemaker:DescribeClusterNode",
"sagemaker:ListClusterNodes",
"sagemaker:ListClusterEvents",
"sagemaker:ListClusters",
"eks:DescribeCluster",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcs",
"ec2:DescribeVpcAttribute",
"ec2:DescribeVpcEndpoints",
"ec2:DescribeRouteTables",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeInstanceTypes",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams",
"logs:FilterLogEvents",
"ssm:StartSession",
"ssm:TerminateSession",
"service-quotas:GetServiceQuota"
]
}
sts:GetCallerIdentity is implicit — it requires no IAM action. SSM on HyperPod uses start-session against sagemaker-cluster:<cluster-id>_<group>-<iid> targets — not send-command against bare instance IDs. For remediation commands, grant the matching write permission (e.g. ec2:AuthorizeSecurityGroupIngress / Egress, ec2:RevokeSecurityGroupIngress / Egress, ec2:ModifyVpcAttribute, sagemaker:UpdateCluster, sagemaker:BatchRebootClusterNodes, sagemaker:BatchReplaceClusterNodes). Not needed for the diagnostic itself.
| Need | Use |
|---|---|
| Cluster creation / deployment failures | hyperpod-cluster-debugger (§ A / B / C / H + --validate) |
| Cluster-wide SSM outage | hyperpod-cluster-debugger § F |
| Single-node SSM failure | stay here — § K |
| Cluster-wide EFA health-check failure at creation time | hyperpod-cluster-debugger § A |
| Single-node EFA failure post-provisioning | stay here — § A |
| NCCL AllReduce / collective-op timeouts (distributed) | hyperpod-nccl |
| Silent GPU NaNs on a specific node (row-remap / DCGM) | stay here — § G.1 (even if discovered by NCCL) |
| Post-deployment cluster-wide management | hyperpod-cluster-debugger |
| Shell / commands on nodes | hyperpod-ssm |
| CUDA / NCCL / EFA version comparison | hyperpod-version-checker |
| Diagnostic bundle for AWS Support | hyperpod-issue-report |
| Training performance / MFU degradation | hyperpod-mfu-debugger |
Escalate when:
# 1. Cluster identity + affected node status
aws sagemaker describe-cluster --cluster-name <CLUSTER> --region <REGION>
aws sagemaker list-cluster-nodes --cluster-name <CLUSTER> --region <REGION> \
--query "ClusterNodeSummaries[?InstanceId=='<INSTANCE_ID>']"
# 2. Triage bundle (scoped to the affected node where possible)
bash scripts/triage-cluster.sh --cluster <CLUSTER> --region <REGION> --node <INSTANCE_ID> > triage.txt
# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report)
# See skills/hyperpod-issue-report/SKILL.md for the exact invocation.
triage.txt from step 2 abovehyperpod-issue-report bundle from step 3Patterns from real customer tickets: node-issue-catalog.md.
npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses and remediates AWS HyperPod cluster issues (EKS/Slurm) including creation failures, EFA health, lifecycle scripts, capacity, EKS access, node replacement, CloudFormation errors, and autoscaler conflicts. Includes pre-flight validation.
Diagnoses and fixes Kubernetes pod failures like CrashLoopBackOff, Pending, DNS, networking, storage mounts, and rollout issues using kubectl workflows and scripts.
Diagnoses Kubernetes cluster health across pods, deployments, nodes, and events, then performs user-confirmed security fixes. Designed for SREs and on-call engineers.