nsys-ai
AI-powered analysis for NVIDIA Nsight Systems profiles
Navigate GPU kernel timelines, diff two runs, and diagnose performance
bottlenecks with an evidence-first agent — from your browser or terminal.
Mission: Build an agent that understands GPU performance from first
principles — one that can identify pipeline bubbles, calculate MFU, assess
arithmetic intensity, and diagnose the root causes that cost millions of GPU
hours, turning months of expert debugging into minutes.

nsys-ai reads .nsys-rep or .sqlite exports from
NVIDIA Nsight Systems and turns
them into something you can navigate and reason about: a web timeline, terminal
viewers, a before/after diff that reports whether a change actually helped, and
a set of deterministic analysis skills an LLM agent can drive. .nsys-rep files
are opened directly — nsys-ai exports them to SQLite for you on first use.
Installation
pip install nsys-ai
No CUDA and no Nsight install are required to analyze a profile. Python 3.10+
only. (Capturing a new .nsys-rep, or converting one, needs the nsys CLI on
your machine; analyzing an existing .sqlite does not.)
Quick start
1. Capture a profile
For ML training, capture a few representative iterations rather than the whole
run — it keeps the profile small and the profiler overhead low. Mark the region
with the CUDA profiler API and trace CUDA plus NVTX:
import torch
for step in range(warmup):
train_step()
torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStart()
for step in range(3): # profile these iterations
train_step()
torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStop()
nsys profile --capture-range=cudaProfilerApi --trace=cuda,nvtx \
-o my_training python train.py
# -> my_training.nsys-rep
--trace=cuda is what every skill relies on (GPU kernels, memory copies, CUDA
API). nvtx adds the annotation hierarchy that drives the iteration, region,
and layer views. To use the iteration tools (iters, diff --iteration),
annotate each step with a consistent NVTX marker — see
Focused Profiling and
NVTX Annotations.
No workload handy? Download an example profile:
cd examples/example-20-megatron-distca && python download_data.py
# -> output/megatron_distca.nsys-rep
2. Open it
# Default: open the web timeline in your browser
nsys-ai my_training.nsys-rep
# Metadata and GPU info
nsys-ai info my_training.nsys-rep
# GPU kernel summary
nsys-ai summary my_training.nsys-rep --gpu 0
Prefer the terminal? The TUIs work the same way:
nsys-ai timeline my_training.nsys-rep --gpu 0 # Perfetto-style horizontal timeline
nsys-ai tui my_training.nsys-rep --gpu 0 # NVTX tree browser
3. Compare two runs
nsys-ai diff before.sqlite after.sqlite
Web timeline
A browser-based multi-GPU viewer with progressive rendering — no --trim
required. This is the default view when you run nsys-ai <profile>.
nsys-ai my_training.nsys-rep # opens in your browser
nsys-ai timeline-web my_training.nsys-rep --gpu 0 1 2 3
- Multi-GPU stacked view with color-coded separators
- Progressive rendering — pre-builds the NVTX tree at startup, then serves tiles
in about a millisecond each
- NVTX hierarchy bars (L0-L5) per GPU
- AI chat sidebar (press
a) and kernel search (press /)
| Input | Action |
|---|
Swipe / h l / arrows | Pan through time |
Swipe up-down / j k | Select stream |
Pinch / Shift+scroll / + - | Zoom |
f or 0 | Fit full time range |
Tab | Next kernel |
/ | Search kernels |
n | Toggle NVTX |
a | AI chat |
? | Help overlay |
Timeline TUI
A Perfetto-style horizontal viewer with per-stream kernels, NVTX hierarchy
bars, and a time-cursor navigation model.
| Key | Action |
|---|
| arrows | Pan time / select stream |
Shift+arrows | Page pan (quarter viewport) |
Tab | Snap to next kernel |
+ - | Zoom |
/ | Filter kernels by name |
m | Minimum-duration threshold |
d | Toggle demangled names |
B | Save bookmark (with kernel + NVTX context) |
C | Config panel (stream rows, tick density, NVTX depth) |
h | Full help overlay |
Profile diff