From sagemaker-ai
Collects diagnostic logs from HyperPod clusters (EKS and Slurm) via SSM for troubleshooting and AWS Support case preparation. Use when investigating node failures or performance issues.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-issue-reportThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled `scripts/hyperpod_issue_report.py` for reliable parallel collection.
Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled scripts/hyperpod_issue_report.py for reliable parallel collection.
sagemaker:DescribeCluster, sagemaker:ListClusterNodes, ssm:StartSession, s3:PutObject, s3:GetObject, eks:DescribeClusters3:GetObject/s3:PutObject on the report bucketCollect from the user:
arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123)s3://bucket/prefix). If the user doesn't have a bucket, create one (e.g., s3://hyperpod-diagnostics-<account-id>-<region>)aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>
If the S3 bucket doesn't exist, create it:
aws s3 mb s3://<bucket-name> --region <region>
For EKS clusters (check Orchestrator.Eks in describe-cluster output):
Ensure kubectl is installed (which kubectl). If missing, install it for the current platform.
Configure kubeconfig using the EKS cluster name from the describe-cluster response:
aws eks update-kubeconfig --name <eks-cluster-name> --region <region>
uv run scripts/hyperpod_issue_report.py \
--cluster <cluster-name-or-arn> \
--region <region> \
--s3-path s3://<bucket>[/prefix]
Use --help for all options including --instance-groups, --nodes, --command, --max-workers, and --debug. Note: --instance-groups and --nodes are mutually exclusive. Node identifiers accept instance IDs (i-*), EKS names (hyperpod-i-*), or Slurm names (ip-*).
After collection, the script shows statistics and offers interactive download. Report the S3 location and offer to:
See references/troubleshooting.md for error handling, large cluster tuning, and known limitations.
npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses and remediates AWS HyperPod cluster issues (EKS/Slurm) including creation failures, EFA health, lifecycle scripts, capacity, EKS access, node replacement, CloudFormation errors, and autoscaler conflicts. Includes pre-flight validation.
Collects CAST AI diagnostic bundle for Kubernetes clusters: agent logs, Helm releases, pod status, events, nodes, API status, and RBAC for support tickets and troubleshooting.
Diagnoses and fixes Kubernetes pod failures like CrashLoopBackOff, Pending, DNS, networking, storage mounts, and rollout issues using kubectl workflows and scripts.