From vanguard-frontier-agentic
Reviews NVIDIA AI networking fabrics (Spectrum-X Ethernet or InfiniBand) for rail-optimized topology, NCCL collective tuning, RoCEv2 lossless DCQCN/PFC config, adaptive routing, and east-west isolation against NCP-AIN standards.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vanguard-frontier-agentic:nvidia-ai-networking-fabric-reviewThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review NVIDIA AI fabric configuration against the NCP-AIN body of knowledge: rail-optimized topology, NCCL collective communication tuning (NCCL_TOPO, NCCL_IB_HCA, NCCL_NET_GDR_LEVEL), RoCEv2 lossless DCQCN/PFC, InfiniBand subnet manager and partitioning, adaptive routing, and tenant/job east-west isolation.
Review NVIDIA AI fabric configuration against the NCP-AIN body of knowledge: rail-optimized topology, NCCL collective communication tuning (NCCL_TOPO, NCCL_IB_HCA, NCCL_NET_GDR_LEVEL), RoCEv2 lossless DCQCN/PFC, InfiniBand subnet manager and partitioning, adaptive routing, and tenant/job east-west isolation.
ibstat, ibdiagnet, nccl-tests all_reduce_perf baselines, ethtool -S, switch QoS counters) when the active client exposes it; otherwise fall back to NVIDIA Spectrum-X / Quantum InfiniBand documentation and sanitized topology diagrams.nccl-tests baselines stored alongside the cluster spec as a medium finding — regressions cannot be detected.Return, at minimum:
npx claudepluginhub raishin/vanguard-frontier-agentic --plugin vanguard-frontier-agenticReviews NVIDIA GPU infrastructure deployments (DGX, HGX, MGX) against reference architectures, checking BMC segmentation, firmware, driver versions, ECC, persistence mode, and MIG configuration.
Validates that a Dynamo deployment's NIXL/UCX/NCCL interconnect is ready for disaggregated serving over RDMA/NVLink. Use after recipe-runner to confirm KV transport is correct, or troubleshoot for diagnosing already-failed pods.
Diagnose uneven NCCL bandwidth across nodes and poor filesystem throughput on Amazon SageMaker HyperPod clusters. Surfaces host-side signals (Xid, ECC, NVLink, EFA reachability, FSx saturation) and routes to sibling skills for remediation.