From pytorch-skills
Analyze PyTorch profiler traces (.json/.json.gz). Use when the user wants to diagnose GPU performance issues, find slow kernels, identify idle time, analyze communication overhead, or debug training bottlenecks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pytorch-skills:profile-modelThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Analyze PyTorch profiler traces (`.json`/`.json.gz` files) to diagnose GPU performance issues, find slow kernels, and identify bottlenecks.
Analyze PyTorch profiler traces (.json/.json.gz files) to diagnose GPU performance issues, find slow kernels, and identify bottlenecks.
Ask the user for the trace file path. Traces are typically:
/path/to/trace.json or /path/to/trace.json.gz/path/to/traces/ containing rank0.json, rank1.json, etc.Load the trace into a pandas DataFrame for flexible querying.
⚠️ IMPORTANT: Ask user before caching. Do NOT cache automatically. Ask:
"Would you like me to cache this trace as Parquet for faster future loads? This saves to
/tmp/trace_cache/and makes subsequent loads faster. (Recommended for large traces or if you plan to ask follow-up questions)"
Only proceed with caching if the user agrees.
DataFrame structure:
name, cat, ph, ts, dur, pid, tid, idargs_device, args_stream, args_Input Dims, etc.Loading approach:
args dict expanded to args_* columnsts and dur to numericParquet caching notes (if user agrees):
/tmp/trace_cache/{basename}.parquetargs_* columns often have mixed types (int, str, None). Before saving to Parquet, detect mixed-type columns and convert them to strings. Replace 'nan'/'None' strings back to None.args_* columns as stringsAfter loading, show a quick summary:
df["cat"].value_counts())Key column values for filtering:
cat == "kernel"cat == "cpu_op"name.str.contains("nccl", case=False)cat contains "mem" or "memcpy"Timestamps (ts) and durations (dur) are in microseconds (μs).
Use standard pandas operations (groupby, filter, sort, aggregate) to answer user questions.
| Finding | Likely Issue | Recommendation |
|---|---|---|
| Many small kernels (avg < 10μs) | Kernel launch overhead | Use torch.compile, fuse operations |
| One kernel dominates (>50%) | Expected for compute-bound | Check if kernel is optimized |
| Many memory operations | Memory-bound | Reduce copies, use pinned memory |
| High communication time | Distributed overhead | Overlap comm with compute |
If the user wants deeper analysis, use HTA (Holistic Trace Analysis):
# Check if HTA is available
try:
from hta.trace_analysis import TraceAnalysis
except ImportError:
print("HTA not installed. Install with: pip install HolisticTraceAnalysis")
| User Question | HTA Function | What It Returns |
|---|---|---|
| "What's taking GPU time?" | get_gpu_kernel_breakdown() | Kernel time distribution |
| "Where's the critical path?" | critical_path_analysis() | Bottleneck path through execution |
| "Why is GPU idle?" | get_idle_time_breakdown() | Host wait vs kernel wait breakdown |
| "Comm/compute overlap?" | get_comm_comp_overlap() | Overlap percentage per rank |
| "Any slow ranks?" | get_potential_stragglers() | Straggler rank identification |
| "Overall time breakdown?" | get_temporal_breakdown() | Idle/compute/non-compute time |
from hta.trace_analysis import TraceAnalysis
# Load traces
analyzer = TraceAnalysis(trace_dir="/path/to/traces/")
# OR: analyzer = TraceAnalysis(trace_files={0: "/path/to/trace.json.gz"})
# GPU Kernel Breakdown
kernel_type_df, kernel_df = analyzer.get_gpu_kernel_breakdown(visualize=False)
# Temporal Breakdown (idle vs compute vs non-compute)
temporal_df = analyzer.get_temporal_breakdown(visualize=False)
# Idle Time Breakdown (why is GPU idle?)
idle_df, _ = analyzer.get_idle_time_breakdown(ranks=[0], visualize=False)
# Critical Path Analysis (for distributed training)
cp_graph, success = analyzer.critical_path_analysis(
rank=0, annotation="ProfilerStep", instance_id=1
)
if success:
analyzer.overlay_critical_path_analysis(
rank=0, critical_path_graph=cp_graph, output_dir="/tmp/cp_output"
)
# Communication-Computation Overlap
overlap_df = analyzer.get_comm_comp_overlap(visualize=False)
# Straggler Detection
stragglers = analyzer.get_potential_stragglers(num_candidates=2)
| Symptom | Diagnosis | Fix |
|---|---|---|
| Many small kernels | Kernel launch overhead | Use torch.compile, operator fusion |
| High idle time (host wait) | CPU bottleneck | Increase DataLoader num_workers, use pin_memory=True |
| High idle time (kernel wait) | Back-to-back kernel gaps | Fuse operations, reduce synchronization |
| Low comm/compute overlap | Blocking collectives | Use async collectives, overlap with compute |
| Stragglers detected | Load imbalance | Check data distribution, batch sizes across ranks |
| Memory copy overhead | Excessive data movement | Use in-place operations, reduce .cpu()/.cuda() calls |
When reporting analysis results:
Example output:
## Trace Analysis Summary
**Main Finding**: GPU is idle 40% of the time, primarily due to CPU bottleneck (host wait).
### Top GPU Kernels (by time)
| % Time | Calls | Avg (μs) | Kernel |
|--------|-------|----------|--------|
| 35.2% | 1000 | 450.3 | ampere_sgemm_128x64 |
| 22.1% | 1000 | 283.5 | volta_nccl_reduce_scatter |
| 15.8% | 2000 | 101.2 | elementwise_kernel |
### Bottleneck: CPU-Bound
The GPU is waiting for the CPU 40% of the time. This suggests data loading or preprocessing is slow.
### Recommendations
1. Increase DataLoader `num_workers` (currently appears to be 0 or 1)
2. Enable `pin_memory=True` in DataLoader
3. Move preprocessing to GPU if possible
**Want deeper analysis?** I can use HTA to get detailed idle time breakdown and critical path analysis.
npx claudepluginhub meta-pytorch/skills --plugin pytorch-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.