From vanguard-frontier-agentic
Statically reviews CUDA C/C++ kernel sources against NVIDIA's performance guidance: memory coalescing, bank conflicts, warp divergence, occupancy, register pressure, and stream concurrency. Emits exact Nsight Compute/Systems commands for runtime confirmation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vanguard-frontier-agentic:nvidia-cuda-kernel-performance-reviewThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Static review of CUDA C/C++ kernels for memory coalescing, shared-memory bank conflicts, occupancy, register pressure, and stream concurrency against NVIDIA's official CUDA Programming and Best Practices Guides. This skill is doc-anchored: it grounds review findings in NVIDIA's published documentation rather than in a certification blueprint, because no NVIDIA certification currently covers thi...
Static review of CUDA C/C++ kernels for memory coalescing, shared-memory bank conflicts, occupancy, register pressure, and stream concurrency against NVIDIA's official CUDA Programming and Best Practices Guides. This skill is doc-anchored: it grounds review findings in NVIDIA's published documentation rather than in a certification blueprint, because no NVIDIA certification currently covers this developer-facing surface as a standalone exam objective.
.cu and .cuh sources as evidence; otherwise fall back to documentation-based inference and say so.cudaDeviceSynchronize inside hot paths or per-batch loops as a medium finding — stream concurrency is destroyed.__restrict__ qualifiers on non-aliasing pointer arguments as a low finding — the compiler cannot keep loads in registers.nsight-compute and nsight-systems commands the user should run for runtime confirmation — do not execute them.Return, at minimum:
npx claudepluginhub raishin/vanguard-frontier-agentic --plugin vanguard-frontier-agenticIdentifies whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, and SASS instruction inspection. Guides optimization strategy selection.
Profiles CUDA/CUTLASS/CuTe DSL/Triton GPU kernels: checks environment, validates correctness, collects Nsight Compute metrics, and classifies bottlenecks (memory/compute/latency/occupancy/mixed bound).
Profiles GPU kernels with Nsight Compute, exports metrics/source/PM-sampling reports, compares baseline vs candidate, classifies stalls, and produces one actionable kernel edit.