From zymtrace
Use when investigating a GPU or CPU workload through the zymtrace MCP. The MCP does most of the analysis; this skill enforces the cross-view — always pull the matching opposite-side flamegraph (CPU for GPU workloads, GPU for CPU workloads) with the same filter. Most bottlenecks hide on the side the customer didn't ask about. Trigger phrases: "analyze my GPU workload", "where's the bottleneck in vllm", "investigate my training job", "find the hot kernel", "GPU isn't saturated", "investigate using flamegraph", "use zymtrace mcp to analyze".
How this skill is triggered — by the user, by Claude, or both
Slash command
/zymtrace:analyze-zymtrace-workloadThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> The zymtrace MCP does most of the work: identifies the workload, fetches flamegraphs, names the hot stacks, surfaces patterns, recommends fixes. This skill's only job is to make sure you **always pull both the GPU and the CPU view with the same filter** — half the time the bottleneck is on the side the customer didn't ask about.
The zymtrace MCP does most of the work: identifies the workload, fetches flamegraphs, names the hot stacks, surfaces patterns, recommends fixes. This skill's only job is to make sure you always pull both the GPU and the CPU view with the same filter — half the time the bottleneck is on the side the customer didn't ask about.
Always recommend a fix. Every 🔴 issue in the recap gets a concrete
**Fix:**block — whether or not the customer asked for solutions. Don't hedge with "let me know if you want suggestions" or "ask about constraints before recommending". Lead with the most plausible specific fix from the data; the customer can push back if their constraints don't fit. Profile analysis without recommendations is incomplete output.
Connection setup lives in configure-zymtrace-mcp. This skill assumes the MCP is already connected.
Optional pairing — GitHub MCP: if the user also has the GitHub MCP connected (claude mcp list shows both) and asks for code-level pointers, Claude can locate the hot frame in their repo and reference a specific <file>:<line> in the fix. This is a suggestion, not a default — many users don't want or need code access from the session. Mention the option once if both MCPs are available; respect the answer either way.
"Analyze the GPU flamegraph over the last 1 hour and suggest solutions."
If the customer hands you anything close to that (or shorter — "what's slow", "investigate my GPU"), interpret it as: scope the analysis to the last 1 hour, pull the GPU flamegraph, cross-check the CPU view, follow the output template below. Don't make them remember the specifics — this is the on-ramp.
Variations the customer might use:
For any of these: default to the last 1 hour if no time range is given, default to the whole cluster if no workload is named (and ask which to narrow if results look noisy).
claude mcp list | grep -i zymtrace
If zymtrace isn't listed → route to configure-zymtrace-mcp. If listed, proceed.
The MCP handles the analysis; you handle the discipline of asking for both sides.
Ask the MCP to investigate the workload the customer named (executable / container / pod / time range / model — whatever signals they gave). The MCP will pick up the right scope.
Pull whichever view the customer's question implies first — GPU view for a GPU-shaped question, CPU view for a CPU-shaped one. Let the MCP narrate what's hot.
Then explicitly ask the MCP for the OPPOSITE view of the same workload, with the same filter. Use the exact filter values the MCP locked onto in step 2 — same executable, same container, same time range. Don't hand-wave the filter; the cross-view is only useful when the slice matches.
Cross-reference the two views. Common reveals:
cudaMemcpy* / aten::* synchronization → the workload is sync-bound on device transfers; the GPU view will show idle stretches.Write the recap using the output template below. Use the data the MCP returned — kernel names, percentages, hot stacks, the call tree from the CPU view, and the kernels triggered on the GPU side — to fill the template. Don't paraphrase the MCP's suggestions verbatim; synthesize across the two views into a concrete next step. If the MCP didn't surface a suggestion, you still produce one — grounded in the returned data, not invented.
If the GitHub MCP is also connected, take the recommendation one step further: locate the hot frame in the customer's repo (file + line) and propose the specific edit. The recap's Fix: block then becomes an actual <file>:<line> reference with a code snippet, not a generic instruction.
Every recap follows this shape. Don't deviate — the structure is the value.
# <Workload type> Flamegraph Analysis
**Observed Call Tree — GPU profile** (<process path / container / time range>)
<top-level frame>
├── <child frame>
│ ├── <leaf frame> → <CUDA kernel that was running at this sample>
│ ├── <leaf frame> → <CUDA kernel ...>
│ └── ...
├── <sync-point frame> → cudaStreamSynchronize + D→H memcpy ⚠️
└── <sync-point frame> → cudaDeviceSynchronize ⚠️
**CPU cross-check** (<same process / container / time range>)
<1–2 sentences naming what the CPU profile adds — DataLoader stalls, Python overhead,
tokenizer hot spots, host-side launch overhead, or "nothing else surfaced; the
constraint is on the GPU side". Keep short.>
**Key Findings**
<1–2 paragraphs naming what the workload IS and the dominant pattern.
Examples: "kernel-launch-bound / dispatcher-overhead", "memory-bandwidth bound",
"DataLoader-starved", "NCCL-collective-bound".>
---
## 🔴 Top issues (max 3, in priority order)
### 1. <Title>
<Observation paragraph — kernel names + percentages from the actual flamegraph. Plain prose, no label.>
**Fix:** <Concrete action — always present, never gated on whether the customer asked for solutions. For inference: name the specific flag/env var with a 1–3 line snippet when the fix is one line. For training: name the most plausible concrete fix from the data (e.g. "wrap with `torch.compile(mode='reduce-overhead')`", "remove `.item()` from the hot loop", "switch to `channels_last`"), not just a family. The customer can push back if constraints don't fit.>
### 2. <Title>
<Observation paragraph.>
**Fix:** <Concrete action.>
### 3. <Title>
<Observation paragraph.>
**Fix:** <Concrete action.>
---
## 🟡 To consider after the above (max 2)
- <One-line observation> — **Fix:** <one-line action>
- <One-line observation> — **Fix:** <one-line action>
---
**Expected Impact**
<Qualitative description of what the fixes should achieve. Numbers only if the
MCP returned them or they're well-known order-of-magnitude estimates.>
Severity & sizing:
Call tree conventions:
├── and └── for the hierarchy (matches what the MCP returns).→ to annotate each leaf with the CUDA kernel that was running when that frame was sampled. This is not a "CPU→GPU" link; it's the kernel underneath that frame.cudaStreamSynchronize, cudaDeviceSynchronize, D→H memcpy) with ⚠️ — they almost always deserve calling out since they kill pipelining.CPU cross-check conventions:
Issue body conventions:
### N. <Title> sub-heading, then a plain prose paragraph (the observation — no Observation: label needed; the paragraph IS the observation), then a blank line, then **Fix:** on its own line in bold with the concrete action.**Fix:** block is the concrete action.
--enable-prefix-caching, VLLM_ATTENTION_BACKEND=..., use_fast=True). Almost always a config knob; cheap to try.torch.compile", "set memory_format=torch.channels_last", "remove .item() from the hot loop", "bump num_workers to 4×GPUs, set pin_memory=True". Don't punt to "name the family and ask". Lead with the recommendation; the customer pushes back if their constraints don't fit.<observation> — **Fix:** <action>. The em-dash + bold Fix label keeps the visual signal even on one line.→ GPU annotations + ⚠️ sync markers), CPU cross-check, Key Findings, 🔴 top issues block (max 3, each with Observation: + Fix:), 🟡 follow-up block (max 2 one-liners), Expected Impact.**Fix:** block — grounded in the actual flamegraph data, never punted ("ask me if you want suggestions") and never invented. Same for the 🟡 follow-ups: each has a **Fix:** after the em-dash.privileged: true) without flagging the security implication. See install-zymtrace-profiler § PC sampling.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub zystem-io/zymtrace-skills --plugin zymtrace