From sagemaker-ai
Diagnoses and remediates AWS HyperPod cluster issues (EKS/Slurm) including creation failures, EFA health, lifecycle scripts, capacity, EKS access, node replacement, CloudFormation errors, and autoscaler conflicts. Includes pre-flight validation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-cluster-debuggerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Operating policy.** Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a **Suggested command (run this yourself)** block and wait for the customer to run it. Destructive order: **investigate → reboot → replace** (replace destroys root + secondary volumes; not supported on Slurm controller nodes).
Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer to run it. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes).
Before any state-changing CLI: ask if it's IaC-managed. HyperPod clusters, SGs, EKS access entries, and IAM are usually provisioned via CloudFormation / CDK / Terraform. If yes, the fix belongs in IaC — running the CLI will drift and the next deploy reverts it. Use the CLI only when IaC is unavailable (locked out, predates IaC, mid-review).
scripts/diagnose-cluster.sh is read-only: it collects state via AWS APIs (and SSM for Slurm controller health) and prints each issue as [FAIL] ... → references/<file>.md § <section>.
| Reference | Open when |
|---|---|
| cluster-diagnostics-detail.md | Per-finding remediation runbook (§ A–L) |
| cluster-operations.md | Operational deep-dives (EFA SG, EKS access, SSM, Slurm, filesystem) |
| cloudformation-errors.md | § H needs the full per-resource CFN error catalog |
| capacity-planning.md | § B or --validate flags capacity / subnet sizing |
| lifecycle-scripts.md | § C points at a specific lifecycle failure |
| iam-permissions.md | Full IAM policy for the diagnostic |
scripts/diagnose-cluster.sh (or --validate for pre-create).[FAIL] line, Read the referenced section.# Diagnose an existing cluster:
bash scripts/diagnose-cluster.sh --cluster <CLUSTER_NAME_OR_ARN> --region <REGION>
# Pre-flight (no cluster needed) — validates SGs, subnets, IAM, VPC endpoints,
# optionally S3 lifecycle scripts and per-AZ capacity:
bash scripts/diagnose-cluster.sh --validate --region <REGION> \
--sg-ids <sg-1,sg-2> --subnet-ids <sub-1,sub-2> [--iam-role <role-arn>] \
[--s3-uri s3://<BUCKET>/path/] [--instance-type ml.p5.48xlarge]
Pass --instance-type when the target instance type is known — enables the per-AZ capacity check (warns if none of the provided subnets are in an AZ that offers that type, which causes insufficient-capacity failures at creation time).
Tags: [PASS] · [FAIL] (counted, has → references/... pointer) · [WARN] · [INFO]. Priorities: P0 blocks operation · P1 degraded · P2 informational.
Error messages / events:
| Signal | Section |
|---|---|
"EFA health checks did not run successfully" (public-doc verbatim signal) | A: EFA Health Checks |
| Insufficient-capacity or AZ-mismatch failure at creation | B: Capacity & AZ |
| Lifecycle-script failure or timeout during provisioning | C: Lifecycle Scripts |
| kubectl auth error (server asks for credentials / no API group list) | D: EKS Access |
InService but not all instances visible | E: Cluster Provisioning |
"Target is not connected" / SSM errors | F: SSM Connectivity |
Node replacement not happening / batch-replace not working | G: Node Replacement |
"Embedded stack failed" / any CloudFormation error | H: CloudFormation Errors |
UpdateClusterSoftware failed or cluster in post-maintenance rollback state | J: AMI & Cluster Updates |
Dangling / orphaned nodes in EKS vs list-cluster-nodes | K: Dangling Nodes & Cleanup |
| Cluster Autoscaler breaks after HyperPod attached | L: Autoscaler Compatibility |
| Slow I/O, FSx throughput saturated | cluster-operations.md § 9 |
| Slurm node name → instance ID lookup | I: Utilities |
SG missing self-reference. Add inbound + outbound self-ref to every SG on the cluster, plus least-privilege egress for the AWS APIs the node needs (HTTPS 443 to S3 / ECR / SageMaker / SSM / STS / CloudWatch Logs — via VPC-endpoint prefix-lists when possible). Full procedure: cluster-diagnostics-detail.md § A.
Instance type unavailable in the requested AZ. Verify with describe-instance-type-offerings, then change AZ, use Flexible Training Plans, or request ODCR. Full: § B · strategy: capacity-planning.md.
Script failed or timed out during provisioning. Read CloudWatch under /aws/sagemaker/Clusters/<name>/<id> — common causes: missing S3 VPC endpoint, IAM gap, CRLF line endings, instance-group name mismatch. Full: § C · layout: lifecycle-scripts.md.
IAM identity not in EKS access entries. Verify with sts get-caller-identity, create an access entry with admin policy, update kubeconfig. Full: § D.
InService without all instances is expected under Continuous Provisioning — failures surface as events, not cluster errors. For stuck Creating/Updating/Deleting: check CFN nested stacks (§ H), IAM, capacity, events; if stuck Deleting check VPC ENI dependencies. Full: § E.
Target is not connected: use sagemaker-cluster:<CLUSTER_ID>_<GROUP>-<INSTANCE_ID> format (not raw EC2 ID), install session-manager-plugin, confirm node Running. Check IAM + VPC endpoints on timeouts. Full: § F.
Auto-repair: confirm NodeRecovery=Automatic, check Health Monitoring Agent (HMA) logs + node labels / Slurm reason, confirm capacity. Manual: reboot first, replace only if reboot fails. Replace requires the cluster to have been patched via UpdateClusterSoftware at least once and cannot target a Slurm controller node. Full: § G.
Embedded stack failed hides the real error. Drill into nested stacks via Events tab (filter Failed) until you reach a non-stack resource. CLI: describe-stack-events --query 'StackEvents[?ResourceStatus==\CREATE_FAILED`]'`. Also covers SLR creation failures and permission-boundary denials. Full: § H · catalog: cloudformation-errors.md.
Map Slurm node names (ip-10-x-y-z) to HyperPod instance IDs via list-cluster-nodes or on-node /opt/ml/config/resource_config.json. Full: § I.
UpdateClusterSoftware fails and rolls back, or the cluster stays in a post-maintenance rollback state. Common causes: lifecycle script incompatible with new AMI, HMA version too old, insufficient rolling-update capacity. If the cluster has active nodes, collect diagnostics and escalate rather than delete-and-recreate. Full: § J.
Nodes in kubectl get nodes but not in list-cluster-nodes (ghost EKS nodes), or the inverse (HyperPod nodes that never registered kubelet). Script flags both. Full: § K.
Cluster Autoscaler errors on HyperPod provider IDs and breaks autoscaling for all node groups. No officially endorsed workaround — escalate to AWS Support. Karpenter does not conflict with HyperPod nodes by default. Full: § L.
aws CLI v2.13+ authenticated to the cluster's accountjq, python3, bash 4.2+kubectl authenticated to the EKS cluster (EKS checks skipped if absent)session-manager-plugin (Slurm controller health checks only)IAM policy: references/iam-permissions.md.
--region or set $AWS_DEFAULT_REGION.--cluster <NAME> (diagnose) or --validate (pre-create).--no-color to force off.| Failure | Script | Tell the customer |
|---|---|---|
aws sts get-caller-identity fails | Exit 1 | "Fix AWS credentials and rerun." |
| Cluster not found | Exit 1 after listing region's clusters | "Confirm HyperPod cluster name (not EKS) and region." |
sagemaker:* / ec2:* / eks:* / logs:* denied | Warn, add Missing IAM permission for <API>, continue | "Grant the listed IAM action and rerun." |
kubectl absent or unauthenticated | Skip EKS checks (access entries, add-ons, aws-auth, nodes) | "Install/authenticate kubectl." |
session-manager-plugin absent (Slurm) | Skip Slurm controller probe | "Install session-manager-plugin." |
| SSM throttled / times out (180s) | Retry with backoff; warn and continue if still failing | "Rerun later — script is idempotent." |
| CloudWatch log group not found | Skip CloudWatch check | "CloudWatch not configured on this cluster." |
Exit codes: 0 no critical failures · 1 one or more critical failures (cluster not found, fatal prerequisite missing, or any [FAIL] in diagnose or --validate mode). [WARN] lines do not affect the exit code.
| Need | Use |
|---|---|
| Shell on nodes | hyperpod-ssm |
| Version comparison across nodes | hyperpod-version-checker |
Escalate when:
Creating, Updating, or a post-maintenance rollback state) for an extended period.Run these commands and attach the output. Goal: AWS Support has everything at case open.
# 1. Cluster identity + status (confirms region, ARN, orchestrator, instance groups)
aws sagemaker describe-cluster --cluster-name <CLUSTER> --region <REGION>
# 2. Full cluster-level diagnostic bundle
bash scripts/diagnose-cluster.sh --cluster <CLUSTER> --region <REGION> > diag.txt
# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report skill)
# See skills/hyperpod-issue-report/SKILL.md for the exact invocation.
ClusterId suffix) and AWS regionClusterStatus + FailureMessage from describe-clusterNodeLogicalIds / instance group namesdiag.txt from step 2 abovehyperpod-issue-report bundle from step 3npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
Diagnoses Kubernetes cluster health across pods, deployments, nodes, and events, then performs user-confirmed security fixes. Designed for SREs and on-call engineers.
Diagnoses and fixes Kubernetes pod failures like CrashLoopBackOff, Pending, DNS, networking, storage mounts, and rollout issues using kubectl workflows and scripts.