From NVIDIA
Validates that a Dynamo deployment's NIXL/UCX/NCCL interconnect is ready for disaggregated serving over RDMA/NVLink. Use after recipe-runner to confirm KV transport is correct, or troubleshoot for diagnosing already-failed pods.
How this skill is triggered — by the user, by Claude, or both
Slash command
/nvidia-skills:dynamo-interconnect-checkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!--
Confirm that the transport disaggregated serving depends on actually works. A deployment can pass an endpoint smoke test while disagg is silently wrong: if NIXL/UCX cannot reach the peer worker over RDMA or NVLink, KV transfer falls back to a slow or broken path. Catch that with read-only checks before trusting a disagg deployment or its benchmark numbers.
This skill is read-only. It never mutates the cluster and never prints secrets.
kubectl exec access to a worker pod in the target Dynamo deployment.recipes/<model>/<framework>/<mode>).ibstat, nvidia-smi, lsmod available in the worker pod image (missing tools are reported as skipped, not failures).dynamo-recipe-runner deploys a disagg or multi-node recipe.For diagnosing pods that are already crashing or unschedulable, use
dynamo-troubleshoot first.
python3 scripts/check_interconnect.py env recipes/<model>/<framework>/<mode>
Reports which NIXL/UCX/NCCL transport variables are set and flags
disagg-critical ones (e.g. UCX_TLS, UCX_NET_DEVICES, NCCL_IB_HCA) that are
absent. Missing here is only a warning — they may be baked into the image — so
confirm with the node and NIXL checks. See
references/interconnect-env-vars.md for what each variable does.
Locally on a GPU node, or inside a running worker pod:
python3 scripts/check_interconnect.py node \
--namespace "${NAMESPACE}" --pod <worker-pod>
Probes (read-only) for: InfiniBand devices and Active links, GPUDirect RDMA
(nvidia_peermem), GDRCopy, and NVLink in the GPU topology. Missing tools are
reported as skipped, not failures.
python3 scripts/check_interconnect.py nixl \
--namespace "${NAMESPACE}" --pod <worker-pod>
Looks for NIXL test tooling in the pod and surfaces the exact next step to run a pairwise prefill↔decode transfer test. A full cross-pod transfer test requires two scheduled GPU pods on the fabric.
| Script | Purpose | Arguments |
|---|---|---|
scripts/check_interconnect.py env | Inspect NIXL/UCX/NCCL env vars on a recipe | positional recipe path |
scripts/check_interconnect.py node | Probe InfiniBand, GPUDirect RDMA, GDRCopy, NVLink on a node or pod | --namespace, --pod |
scripts/check_interconnect.py nixl | Surface NIXL transfer-test readiness for a pod | --namespace, --pod |
Invoke via the agentskills.io run_script() protocol:
run_script("scripts/check_interconnect.py", args=["env", "recipes/qwen3-coder-480b/sglang/disagg"])
run_script("scripts/check_interconnect.py", args=["node", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])
Verify a disagg recipe's transport env shape before deploy:
python3 scripts/check_interconnect.py env recipes/qwen3-coder-480b/sglang/disagg
After deploy, validate a worker pod's fabric:
python3 scripts/check_interconnect.py node \
--namespace dynamo-demo --pod qwen-worker-0
python3 scripts/check_interconnect.py nixl \
--namespace dynamo-demo --pod qwen-worker-0
Equivalent through the agent protocol:
run_script("scripts/check_interconnect.py", args=["nixl", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])
Each check returns ok / warn / fail / skipped with a one-line detail,
plus a rolled-up verdict on disagg transport readiness. Report:
skipped results for missing tools (ibstat, nvidia-smi, lsmod) are inconclusive, not a pass.| Symptom | Likely cause | Next step |
|---|---|---|
env reports all critical vars missing | Vars baked into image or injected by operator | Run the node check inside the worker pod to verify actual env |
node reports no Active IB link | Fabric down or HCA not provisioned to the node | Contact cluster admin; verify kubectl describe node shows nvidia.com/gpu and IB labels |
nvidia_peermem missing | GPUDirect RDMA module not loaded | Ask cluster admin to load nvidia-peermem; without it, NIXL falls back to staged copies |
nixl finds no test tools | Worker image lacks NIXL test harness | Use a NIXL-enabled image or run the standalone transfer test from a debug pod |
See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.
references/interconnect-env-vars.md — NIXL/UCX/NCCL env var catalog and IB
capability checklist.scripts/check_interconnect.py for all read-only checks.npx claudepluginhub nvidia/skills --plugin nvidia-skillsReviews NVIDIA AI networking fabrics (Spectrum-X Ethernet or InfiniBand) for rail-optimized topology, NCCL collective tuning, RoCEv2 lossless DCQCN/PFC config, adaptive routing, and east-west isolation against NCP-AIN standards.
Starts or patches Dynamo router modes (round-robin, KV, least-loaded, device-aware) and runs endpoint smoke checks. Useful for bring-up and mode comparison.
Diagnose uneven NCCL bandwidth across nodes and poor filesystem throughput on Amazon SageMaker HyperPod clusters. Surfaces host-side signals (Xid, ECC, NVLink, EFA reachability, FSx saturation) and routes to sibling skills for remediation.