Skill

hyperpod-ssm

Executes remote commands and transfers files on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). Required for any node-level shell access — direct SSH is unavailable.

AWS

devops

infrastructure

Popularity

Parent stars

748

Parent forks

102

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/sagemaker-ai:hyperpod-ssm

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- **`aws` CLI v2**, authenticated for the target account/Region.

Supporting Files

references/troubleshooting.mdscripts/get-cluster-info.shscripts/list-nodes.shscripts/ssm-exec.sh

SKILL.md

107 lines · ~1.3k tokens

Stats

LanguageShell

Parent stars748

Parent forks102

MaintenanceExcellent

Last CommitMay 16, 2026

Actions

View Source View Plugin View on GitHub View README

HyperPod SSM Access

Prerequisites

aws CLI v2, authenticated for the target account/Region.
session-manager-plugin — installed alongside the AWS CLI.
jq — the scripts build JSON payloads with it.
unbuffer (from the expect package) — wraps aws ssm start-session with a PTY so the session-manager-plugin flushes stdout instead of racing to close. Without it, calls intermittently return empty output with Cannot perform start session: EOF even when the command ran. Install with sudo yum install expect, sudo apt install expect, or brew install expect. ssm-exec.sh detects and uses it automatically; falls back with a warning if missing.

SSM Target Format

Target: sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>

CLUSTER_ID: Last segment of cluster ARN (NOT the cluster name). Extract via get-cluster-info.sh.
GROUP_NAME: Instance group name — retrieve via list-nodes.sh.
INSTANCE_ID: EC2 instance ID (e.g., i-0123456789abcdef0)

Scripts

Three scripts under scripts/. Resolve cluster info and nodes once, then execute per node.

get-cluster-info.sh — Resolve cluster name → ID (call once)

scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]
# Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}

list-nodes.sh — List all nodes with pagination (call once)

scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]
# Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)

list-cluster-nodes paginates at 100 nodes. This script handles pagination automatically.

ssm-exec.sh — Execute command on a node (call per node)

# Execute — with pre-built target
scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]

# Execute — with parts
scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]

# Upload
scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]

# Read remote file
scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]

Running Commands Across Many Nodes

SSM start-session rate limit: 3 TPS per account. Plan batch size and delay accordingly.

aws ssm send-command does NOT support sagemaker-cluster: targets — only start-session works.

Manual SSM Commands

When the scripts aren't suitable, use aws ssm start-session directly with AWS-StartNonInteractiveCommand. Wrap every invocation in unbuffer — without it, stdout is intermittently empty (see Prerequisites).

cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF

unbuffer aws ssm start-session \
  --target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
  --region REGION \
  --document-name AWS-StartNonInteractiveCommand \
  --parameters file:///tmp/cmd.json

Always use a JSON file for --parameters — inline parameters break with special characters.
The document's command parameter is argv, not shell input. Wrap multi-statement scripts in bash -c '...' so pipes, semicolons, and redirects evaluate.

Common Diagnostic Commands

Task	Command
Lifecycle logs	`cat /var/log/provision/provisioning.log`
Memory	`free -h`
Disk/mounts	`df -h && lsblk`
GPU status	`nvidia-smi`
GPU memory	`nvidia-smi --query-gpu=memory.used,memory.total --format=csv`
EFA/network	`fi_info -p efa`
CloudWatch agent	`sudo systemctl status amazon-cloudwatch-agent`
Top processes	`ps aux --sort=-%mem \| head -20`

Key Details

Default SSM non-interactive user is root.
SSM rate limit: 3 TPS per account.
For interactive sessions (rare), omit --document-name to get a shell.
Interactive commands (vim, top) are not supported via AWS-StartNonInteractiveCommand.
Large outputs may be truncated by SSM.
For troubleshooting common errors, see references/troubleshooting.md.

hyperpod-ssm

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

hyperpod-ssm

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

HyperPod SSM Access

Prerequisites

SSM Target Format

Scripts

get-cluster-info.sh — Resolve cluster name → ID (call once)

list-nodes.sh — List all nodes with pagination (call once)

ssm-exec.sh — Execute command on a node (call per node)

Running Commands Across Many Nodes

Manual SSM Commands

Common Diagnostic Commands

Key Details

Similar Skills

HyperPod SSM Access

Prerequisites

SSM Target Format

Scripts

get-cluster-info.sh — Resolve cluster name → ID (call once)

list-nodes.sh — List all nodes with pagination (call once)

ssm-exec.sh — Execute command on a node (call per node)

Running Commands Across Many Nodes

Manual SSM Commands

Common Diagnostic Commands

Key Details

Similar Skills