From zymtrace
Use when installing the zymtrace profiler agent on Kubernetes (Helm DaemonSet), Docker, or as a binary with systemd. Covers CPU-only profiling, CUDA GPU profiling (CUDA 12.x or higher required; CUDA 11.x and below not supported), GPU metrics (utilization, memory, temperature, SM efficiency, Tensor Core, PCIe), framework-specific metrics (vLLM, SGLang, NVIDIA Dynamo-Triton), and air-gapped installs via a custom image registry. Connects the agent to an existing backend gateway. Trigger phrases: "install profiler", "install zymtrace profiler", "install zymtrace agent", "deploy the profiler", "deploy zymtrace DaemonSet", "set up GPU profiling", "set up CUDA profiling", "profile my GPU workloads", "install profiler on EKS / GKE / Slurm / bare-metal", "install profiler on every node", "start collecting profiles".
How this skill is triggered — by the user, by Claude, or both
Slash command
/zymtrace:install-zymtrace-profilerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Helps the user install the zymtrace **profiler agent** — the low-overhead eBPF + CUDA profiler that runs on every node and ships profiles to the backend gateway.
Helps the user install the zymtrace profiler agent — the low-overhead eBPF + CUDA profiler that runs on every node and ships profiles to the backend gateway.
The backend must be installed first (
install-zymtrace-backendskill). Profiler agents have nowhere to send profiles otherwise.
Deep details (NVML library paths, PC sampling, env var reference, air-gapped image mirroring, framework metrics) live in reference.md.
Open with a short welcome before any commands or questions:
👋 Thanks for installing the zymtrace profiler! It's a low-overhead agent (<1% CPU, ~256 MB RAM) that ships profiles to your backend.
Stuck? Reach out:
- Community Slack: https://join.slack.com/t/zymtrace/shared_invite/zt-3fdidjufl-q~NHxDzQlzal2B9mujfaoQ
- Email: [email protected]
Tip — analyze GPU and CPU flamegraphs via MCP: once profiles start flowing, run
/mcpin this Claude Code session to connect to the zymtrace MCP server and analyze GPU + CPU flamegraphs from this terminal. Docs: https://docs.zymtrace.com/mcpHere's the plan:
- Verify your tools and locate the backend gateway.
- Decide CPU-only vs CUDA (GPU) profiling.
- Install the DaemonSet (or Docker / binary).
- Verify the agent is reporting.
Ready when you are.
If the user has already specified Helm / GPU / target, skip the roadmap and dive in.
values.yaml: https://raw.githubusercontent.com/zystem-io/zymtrace-charts/main/charts/profiler/values.yamlhelm version --short && kubectl version --client
kubectl cluster-info | head -2
helm list -A | grep -i zymtrace # locate the backend release
If helm/kubectl are missing → point to install docs; do not install them. If no backend release is found anywhere → STOP and route to install-zymtrace-backend first; the profiler has nowhere to send data without it.
Recommend defaults
zymtrace/profilerfor namespace + release name. Full policy:shared/conventions.md.
| Variable | Resolve by |
|---|---|
| Backend release & its namespace | helm list -A | grep -i 'backend.*zymtrace' |
| Backend gateway service FQDN | <PREFIX>-gateway.<backend-NS>.svc.cluster.local:80 (in-cluster) or external ingress host |
| GPU nodes present? | kubectl get nodes -l nvidia.com/gpu=true 2>/dev/null | wc -l |
| NVIDIA device plugin / GPU operator? | kubectl get pods -A | grep -E 'nvidia-device-plugin|gpu-operator' |
| CUDA runtime ≥ 12.x? (required for GPU profiling) | kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.labels.nvidia\.com/cuda\.runtime-version\.major}{"\n"}{end}' | sort -u — must be 12 or higher. If labels are missing, kubectl exec into an existing GPU pod and run nvidia-smi, or ask the user to run nvidia-smi on a GPU node themselves and report back (the skill usually can't SSH into the node). |
| Existing profiler release? | helm list -A | grep -i 'profiler.*zymtrace' |
| Customer-provided values file? | Ask — same rule as backend skills. Filename respected. |
Things you must ask:
Blockers (stop, surface):
Recommendations (note, proceed):
| Method | When |
|---|---|
| Helm DaemonSet (recommended) | Kubernetes — gives you the full chart with GPU support and framework metrics. |
| kubectl manifest | Kubernetes without Helm. CPU-only is fine; GPU profiling has fewer config knobs. |
| Docker | Single host (non-k8s) — same VM the workload runs on. |
| Binary + systemd | Bare-metal Linux (Slurm nodes, on-prem GPU servers). |
| Profiling | Template | What it gives you |
|---|---|---|
| CPU only | values/helm-cpu.yaml | eBPF unwinder for native, Python, Go, Java, Node. No CUDA libraries. |
| GPU (CUDA) | values/helm-gpu.yaml | Above + CUDA kernel profiling + GPU metrics + vLLM/SGLang/Triton metrics. Same template works for MIG slices and dense multi-GPU nodes. |
GPU profiling requires CUDA 12.x or higher. Older CUDA runtimes are not supported. If the customer's nodes are on CUDA 11.x, GPU profiling is a blocker — recommend CPU-only profiling until they upgrade the CUDA toolkit / driver. Architecture: AMD64/x86_64 and ARM64 are both supported.
If GPU nodes exist on the cluster (nvidia.com/gpu=true label) and CUDA ≥ 12.x, default-recommend GPU profiling. The agent works on CPU nodes too — cudaProfiler.enabled: true is a no-op when there's no NVIDIA driver.
Where does the agent send profiles? The format is always <host-or-ip>:<port> — never a URL with https:// in front. TLS is controlled by the -disable-tls flag, not by the URL scheme.
| Setup | Set -collection-agent to |
|---|---|
| Same cluster as backend (default in templates) | <PREFIX>-gateway.<backend-NS>.svc.cluster.local:80 + -disable-tls |
| Different cluster, backend has TLS ingress | <gateway-host>:443 — remove -disable-tls |
| Different cluster, NodePort | <any-node-ip>:<nodeport> + -disable-tls |
Resolve <PREFIX> and <backend-NS> from the backend release: helm list -A | grep -i 'backend.*zymtrace'.
If mentioned, mirror ghcr.io/zystem-io/zymtrace-pub-profiler:<VERSION> into the customer's registry and set:
global:
imageRegistry: "<your-registry>"
registry:
requirePullSecret: true # only if registry needs auth
Full procedure: reference.md § Air-gapped install.
helm repo add zymtrace https://helm.zystem.io # idempotent if already added
helm repo update zymtrace
helm search repo zymtrace/profiler --versions | head -5
Copy the matching template from values/ to zymtrace-profiler-values.yaml in the user's working directory. Don't ask the customer if they already have one — customers typically don't ship with a profiler values file (the backend often does, the profiler rarely does). Only respect a different filename if they explicitly volunteer that they have one (per shared/conventions.md).
Pick the template that fits and edit:
profiler.args[0] -collection-agent=... to the backend gateway endpoint.-disable-tls if pointing at an HTTPS endpoint.nodeSelector: nvidia.com/gpu: "true" matches your cluster's GPU label.Print the exact command + resolved values (release name, namespace, target backend gateway, GPU yes/no, image tag if pinned). Wait for explicit confirmation.
helm upgrade --install <REL> zymtrace/profiler \
--namespace <NS> --create-namespace \
-f <values-file> \
--reset-then-reuse-values \
--atomic --debug
Default <REL> = profiler, <NS> = zymtrace (same namespace as backend works fine — no conflict).
ERROR: daemonset.apps/zymtrace-profiler is not ready: pods are not ready → expected for ~30s while pods start. Wait. If it persists past 2 min, kubectl describe ds -n <NS> <PREFIX>-profiler.
ERROR: ImagePullBackOff → registry / version mismatch. Verify image tag at https://github.com/orgs/zystem-io/packages.
ERROR: pod CrashLoopBackOff with permission denied on /sys/kernel/debug → kernel doesn't support eBPF or the security context is being stripped. kubectl describe pod to confirm, and check securityContext.capabilities.add: [SYS_ADMIN] is honored.
./scripts/verify-profiler.sh <NS> <REL>
Checks DaemonSet readiness, license validity in logs, GPU library extraction (if cudaProfiler.enabled), and connection-to-gateway evidence (no connection refused or dns lookup failed in last 50 log lines).
helm get values <REL> -n <NS> > <values-file>
Recommend the customer commit the file: git add <values-file> && git commit -m "zymtrace: profiler install for <NS>/<REL>".
If GPU profiling was enabled, the agent is running but no workload is being profiled yet — workloads need to set CUDA_INJECTION64_PATH. The pattern is one env var; the variation is where you set it (Slurm prolog, k8s pod spec, Docker -e, ~/.bashrc).
Point the user at reference.md § Profiling real workloads — it covers:
hostIPC, --shm-size, VLLM_ATTENTION_BACKEND), with concrete Docker and Kubernetes manifests.Authoritative docs: https://docs.zymtrace.com/install/profiler/cuda-gpu-profiler and https://docs.zymtrace.com/install/profiler/gpu-profiler-quick-start.
CPU-only installs are done at this point — profiles for every process on every node start flowing within ~30 seconds.
For kubectl manifest, Docker, or binary+systemd installs, see reference.md § Other install methods. The same pre-flight rules apply; only Step 4 (install) differs.
Exit when ALL of the following are true (substitute <NS> / <REL> / <PREFIX>):
helm status <REL> -n <NS> reports STATUS: deployed.<PREFIX>-profiler has DESIRED == READY (kubectl get ds -n <NS>).Your license is valid until … OR streaming connection established (means the agent reached the backend).connection refused, dns lookup failed, or forbidden in the last 50 log lines of any pod.ls -la /var/lib/zymtrace/profiler on a GPU node shows libzymtracecudaprofiler.so (the agent extracted it for workload mounts).If any box fails, route by symptom:
troubleshoot-zymtrace-profiler.troubleshoot-zymtrace-backend, which walks the full backend ↔ profiler path.Both have first-pass diagnostic scripts to run before going manual.
-collection-agent accepts only host:port, never a URL. Use zymtrace.example.com:443 — NOT https://zymtrace.example.com:443 or https://zymtrace.example.com/. Adding a scheme prefix makes the agent fail to parse the value and retry forever. TLS is controlled separately by presence/absence of the -disable-tls flag, not by the URL scheme.-collection-agent (typo in <backend-NS>, or pointed at a non-existent service) → agent retries forever, no profiles appear. Test with kubectl run -it --rm dns-test --image=busybox -- nslookup <PREFIX>-gateway.<backend-NS>.svc.cluster.local.-disable-tls left on when targeting HTTPS ingress → handshake fails. Remove the flag.-disable-tls removed when targeting in-cluster ClusterIP → connection refused (service is HTTP). Add it back.nodeSelector or scope to GPU nodes.kubectl get nodes --show-labels to confirm.--nvml-auto-scan left on permanently → fine for first install (detects NVML path), but for production, switch to --nvml-path=<path> once the path is known (saves startup scan).helm upgrade / helm upgrade --install for this chart without --reset-then-reuse-values. See shared/conventions.md.-project= / ZYMTRACE_PROJECT in profiler args, env vars, or values templates. Always use the default project — the agent creates it automatically.privileged: true on the k8s pod, or sudo when running a bare binary. That's a workload-side change that needs explicit user confirmation. Recommend it for dev/staging deep-dives or on-demand production debugging. See reference.md § PC sampling.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub zystem-io/zymtrace-skills --plugin zymtrace