From sagemaker-ai
Diagnoses NCCL failures and training-pod issues on HyperPod GPU clusters (EKS or Slurm). Detects training hangs, collective-op timeouts, EFA errors, OOMs, and pod scheduling problems via AWS/kubectl/SSM probes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-ncclThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Operating policy.** Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a **Suggested command (run this yourself)** block and wait for the customer. Destructive order: **investigate → reboot → replace** (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Never discard training state on sp...
Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Never discard training state on speculation.
Diagnose NCCL failures on SageMaker HyperPod (EKS and Slurm). scripts/nccl-diagnose.sh reads state via AWS APIs, kubectl, and SSM, then prints each issue as [FAIL] ... → references/<file>.md § <section>. Read-only.
Signal sourcing: list-cluster-events carries infrastructure-level state only (lifecycle, bootstrap, EFA health check, capacity, replacement, reboot, AMI rollback). It does not carry NCCL timeouts, GPU XID/ECC, or per-pod training signals — those come from pod logs, CloudWatch training streams, on-node SSM probes, and NCCL env audit. "No events" on a training-time NCCL issue is expected, not a clean bill of health.
[FAIL] line, Read the referenced section.If a finding has no matching section, report it as a bug — do not invent a fix.
EKS_ARN=$(aws sagemaker describe-cluster --cluster-name <HYPERPOD-NAME> --region <REGION> \
--query 'Orchestrator.Eks.ClusterArn' --output text)
EKS_NAME=$(echo "$EKS_ARN" | awk -F'/' '{print $NF}')
aws eks update-kubeconfig --name "$EKS_NAME" --region <REGION>
kubectl get nodes
# Basic:
bash scripts/nccl-diagnose.sh --cluster <HYPERPOD-NAME> --region <REGION>
# Scope to an EKS job/namespace:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --namespace <NS> --job <JOB>
# Force orchestrator:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --orchestrator slurm
# Larger hardware sample (default 3):
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --sample-nodes 10
# Specific node only:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --node i-0abc123def456
Tags: [PASS] · [FAIL] (counted in Issues Found, has reference pointer) · [WARN] · [INFO]. Priorities: P0 blocks training · P1 degraded · P2 informational.
Each [FAIL] line in the script already points directly at the right section. This table is a lookup for manual triage.
| Finding | Section |
|---|---|
| SG missing inbound/outbound self-reference | operations.md § 8 |
| Blocking NetworkPolicy / allow-all missing | operations.md § 8 |
| Slurm node DOWN / DRAINING / RemoveIPC | operations.md § 7 |
| GPU XID / SYSTEM_ERROR / hardware fault | hyperpod-node-debugger § F / § G |
| GPU row-remap / DCGM Fail / silent NaNs | hyperpod-node-debugger § G.1.a/b |
| NCCL timeout / rendezvous / straggler | debugging-guide.md § 1 |
| EFA configuration / not used | debugging-guide.md § 6 |
EFA TCP fallback (NET/OFI Using TCP) | debugging-guide.md § 13 |
| NCCL version mismatch across pods | debugging-guide.md § 10 |
| Container OOM (pod killed, exit 137) | debugging-guide.md § 4 |
GPU OOM (CUDA out of memory) | debugging-guide.md § 11 |
RDMA memlock / /dev/shm too small | debugging-guide.md § 17 |
| MASTER_ADDR DNS / headless Service | debugging-guide.md § 12 |
| NVLS / PXN / topology tuning | debugging-guide.md § 19 |
| Any NCCL / EFA / rendezvous log pattern | error-patterns-quick-ref.md |
| Performance / nccl-tests / bandwidth | performance-testing.md |
aws CLI v2.13+ authenticated (aws sts get-caller-identity)jq, python3, bash 4.2+unbuffer (from the expect package: yum install expect / apt install expect)kubectl authenticated to the EKS cluster (K8s checks skipped if absent)session-manager-plugin for on-node hardware checks--region or set $AWS_DEFAULT_REGION.--orchestrator eks|slurm.--namespace <NS> --job <JOB>.--node <ID> for a specific node. Node probes run serially (180 s per node): --sample-nodes 10 can take ~30 min.TERM=dumb.| Failure | Script | Tell the customer |
|---|---|---|
aws sts get-caller-identity fails | Exit 1 with the AWS error | "Fix AWS credentials and rerun." |
describe-cluster AccessDenied | Warn, add Missing IAM for sagemaker:DescribeCluster | "Grant sagemaker:DescribeCluster (operations.md § 2)." |
| Cluster not found | Exit 1 after listing region's clusters | "Confirm HyperPod cluster name and region." |
kubectl absent / unauthenticated | Warn, skip K8s checks | "aws eks update-kubeconfig --name <EKS> --region <R>." |
| SSM plugin absent | Warn, skip on-node hardware checks | "Install session-manager-plugin." |
| SSM times out (180s) | Partial output, mark node unreachable | "Rerun with --node <ID> --sample-nodes 1; check SSM agent on the node." |
| CloudWatch log group not found | Skip CloudWatch scan | "Enable CloudWatch on the cluster (operations.md § 4)." |
| Cluster events API throttled | Warn, continue with partial data | "Rerun later — script is idempotent." |
Exit codes: 0 diagnostic complete · 1 fatal prerequisite missing or cluster unreachable.
Full policy + RBAC in operations.md § 2. SSM on HyperPod uses start-session against sagemaker-cluster:<cluster-id>_<group>-<iid> targets — grant ssm:StartSession / ssm:TerminateSession, not ssm:SendCommand.
| Scope | Method | Coverage |
|---|---|---|
| All nodes | sagemaker:ListClusterNodes (paginated) | 100% nodes |
| All K8s objects | kubectl | 100% pods/nodes/policies |
| Hardware | SSM --sample-nodes N (default 3) | Sampled |
| Node logs | CloudWatch | 100% nodes |
Large clusters: the PyTorch NCCL backend defaults to a 10-minute collective-op timeout (per the PyTorch distributed docs). Large clusters routinely exceed that on first rendezvous; raise it via torch.distributed.init_process_group(timeout=timedelta(seconds=<N>)). HyperPod support has also observed NCCL topology-graph-search hangs on 256+ node clusters when memlock is unlimited; using a large fixed memlock (e.g. 8388608) in pod securityContext or /etc/security/limits.conf has cleared these in field cases. This memlock pattern is a field observation, not AWS- or NCCL-documented behavior.
For FSDP, DeepSpeed, or Megatron-LM tuning: debugging-guide.md § 18.
| Need | Use |
|---|---|
| Cluster creation / deployment failures | hyperpod-cluster-debugger (§ A / B / C / H + --validate) |
| Post-deployment cluster-wide management | hyperpod-cluster-debugger |
| Per-node issues (disk, lifecycle, hardware) | hyperpod-node-debugger |
| Trainium/Inferentia collective-comm (AWS Neuron Collectives, not NCCL) | hyperpod-node-debugger § G.2 |
| Shell on nodes | hyperpod-ssm |
| Version comparison across nodes | hyperpod-version-checker |
| Diagnostic bundle for AWS Support | hyperpod-issue-report |
| MFU / performance degradation | hyperpod-mfu-debugger |
Escalate when:
Issues Found: 0 but training still fails.# 1. Cluster identity + status
aws sagemaker describe-cluster --cluster-name <C> --region <R>
# 2. Full NCCL diagnostic (sample more nodes for escalation)
bash scripts/nccl-diagnose.sh --cluster <C> --region <R> --sample-nodes 10 > nccl-diag.txt
# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report)
# See skills/hyperpod-issue-report/SKILL.md for the exact invocation.
nccl-diag.txt from step 2 abovehyperpod-issue-report bundle from step 3printenv | grep -E '^NCCL|^FI_|^TORCH_' from one pod)npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
Diagnoses and fixes CoreWeave Kubernetes errors: GPU scheduling failures, pending pods, CUDA OOM, NCCL timeouts, PVC mounting, image pull issues.
Collects, cleans, and diagnoses NPU faults on Ascend hardware. Supports cluster/single-node/super-node scenarios for training/inference failures or performance degradation using log_collector.py.