From sagemaker-ai
Executes remote commands and transfers files on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). Required for any node-level shell access — direct SSH is unavailable.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-ssmThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- **`aws` CLI v2**, authenticated for the target account/Region.
aws CLI v2, authenticated for the target account/Region.session-manager-plugin — installed alongside the AWS CLI.jq — the scripts build JSON payloads with it.unbuffer (from the expect package) — wraps aws ssm start-session with a PTY so the session-manager-plugin flushes stdout instead of racing to close. Without it, calls intermittently return empty output with Cannot perform start session: EOF even when the command ran. Install with sudo yum install expect, sudo apt install expect, or brew install expect. ssm-exec.sh detects and uses it automatically; falls back with a warning if missing.Target: sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>
CLUSTER_ID: Last segment of cluster ARN (NOT the cluster name). Extract via get-cluster-info.sh.GROUP_NAME: Instance group name — retrieve via list-nodes.sh.INSTANCE_ID: EC2 instance ID (e.g., i-0123456789abcdef0)Three scripts under scripts/. Resolve cluster info and nodes once, then execute per node.
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]
# Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]
# Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)
list-cluster-nodes paginates at 100 nodes. This script handles pagination automatically.
# Execute — with pre-built target
scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]
# Execute — with parts
scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]
# Upload
scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]
# Read remote file
scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
SSM start-session rate limit: 3 TPS per account. Plan batch size and delay accordingly.
aws ssm send-command does NOT support sagemaker-cluster: targets — only start-session works.
When the scripts aren't suitable, use aws ssm start-session directly with AWS-StartNonInteractiveCommand. Wrap every invocation in unbuffer — without it, stdout is intermittently empty (see Prerequisites).
cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF
unbuffer aws ssm start-session \
--target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
--region REGION \
--document-name AWS-StartNonInteractiveCommand \
--parameters file:///tmp/cmd.json
--parameters — inline parameters break with special characters.command parameter is argv, not shell input. Wrap multi-statement scripts in bash -c '...' so pipes, semicolons, and redirects evaluate.| Task | Command |
|---|---|
| Lifecycle logs | cat /var/log/provision/provisioning.log |
| Memory | free -h |
| Disk/mounts | df -h && lsblk |
| GPU status | nvidia-smi |
| GPU memory | nvidia-smi --query-gpu=memory.used,memory.total --format=csv |
| EFA/network | fi_info -p efa |
| CloudWatch agent | sudo systemctl status amazon-cloudwatch-agent |
| Top processes | ps aux --sort=-%mem | head -20 |
root.--document-name to get a shell.AWS-StartNonInteractiveCommand.npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
Controls remote GPU clusters via `rca` CLI — run commands, transfer files, inspect GPUs/nodes, sync with mutagen. Handles install, SSH config, daemon lifecycle, execution, and node status.
Connects to cloud instances via SSH/CLI and diagnoses access blockers. Supports AWS EC2, Aliyun ECS, bastions, file transfer, log inspection, and process checks within authorization boundaries.