From qa-load-testing
Reads CPU flame-graph output from py-spy (Python), async-profiler (JVM), Go pprof, or Node.js perf_hooks / clinic.js - identifies the hot path (top sample-time stack frames), classifies the bottleneck (CPU-bound vs lock contention vs allocator pressure), and proposes the next investigation step. Use when a perf regression has been bisected to a commit but the hot path inside that commit is unclear.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-load-testing:flame-graph-analyzerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Canonical flame-graph reference: [brendan-gregg-flame][bg]. Widest leaf = hot path.
Canonical flame-graph reference: brendan-gregg-flame. Widest leaf = hot path.
Different runtimes produce flame graphs from different profilers:
| Runtime | Profiler |
|---|---|
| Python | py-spy |
| JVM | async-profiler |
| Go | Go's built-in pprof (runtime/pprof) |
| Node.js | clinic.js flame (Clinic's bundled profiler) |
| Native (C/C++/Rust) | perf / dtrace / Linux's eBPF tools |
This skill is language-agnostic - it consumes the flame graph
output (SVG, JSON, or folded-stacks .txt) and surfaces a
hypothesis the engineer can act on.
perf-regression-bisector)
but the introducing commit touches multiple functions; the team
needs to know which function is the actual hot path.k6-load-testing
or sibling shows latency growth, but the API code hasn't visibly
changed - flame graph reveals the runtime cause.EXPLAIN ANALYZE trace from a SQL query suggests CPU is the
bottleneck rather than I/O - flame graph confirms.For each runtime, the canonical capture command:
py-spy record -o flame.svg -d 30 --pid <pid>
# OR run-and-record
py-spy record -o flame.svg -d 30 -- python app.py
Output: SVG + folded-stacks .txt (with --format raw).
java -agentpath:/path/to/libasyncProfiler.so=start,event=cpu,duration=30s,file=flame.html ...
# OR via the agent jar
java -agentpath:async-profiler/build/libasyncProfiler.so=start,event=cpu,file=profile.jfr ...
Output: HTML flame graph or JFR (Java Flight Recorder) format.
# In-process: import _ "net/http/pprof" + http.ListenAndServe(":6060", nil)
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
Open the served URL → "VIEW → Flame Graph".
npx clinic flame -- node app.js
# Generates flame.html when the process exits.
All major profilers can emit "folded stacks" - one line per unique stack with its sample count:
main;handleRequest;serializeJson;Buffer.from 4521
main;handleRequest;dbQuery;parseRows 1832
main;handleRequest;authCheck;jwtVerify 904
Brendan Gregg's flamegraph.pl consumes this format directly.
Folded-stacks is the canonical machine-readable form for this
skill's analysis.
Read the folded stacks (or extract from SVG / JSON), sort by sample count, identify the top 5 leaves (bottom of stack - the actual working code, not framework wrappers).
# Top 5 leaves by sample count
sort -k2 -n -r folded.txt | head -5
Example output:
main;handleRequest;serializeJson;JSON.stringify 4521
main;handleRequest;dbQuery;Array.from 2103
main;handleRequest;dbQuery;parseRows 1832
main;authCheck;jwt.verify;crypto.createHash 904
main;handleRequest;serializeJson;Buffer.from 312
For each hot path, the sample-time signature points to a category:
| Category | Signature in flame graph |
|---|---|
| CPU-bound (hot algo) | A wide leaf in user code (e.g. a regex, a JSON serializer, a hash function). |
| Allocator pressure | Wide GC frames (gc::scavenge, Java GC, Python's gc.collect). |
| Lock contention | Wide synchronization frames (pthread_mutex_lock, Object.wait, parking). |
| I/O wait misclassified | If the profiler is on-CPU only, I/O blocks won't appear. Switch to wall-clock profiling. |
| Reflection / dynamic dispatch overhead | Wide reflection.invoke, method_missing, getattr chains. |
| Logging overhead | Wide log.format, Logger.debug, JSON serialization for log lines. |
Map category → typical fix:
| Category | Typical fix |
|---|---|
| CPU-bound hot algo | Cache the result; switch to a faster algorithm; move out of the hot path. |
| Allocator pressure | Reuse buffers / pools; switch to streaming serialization; |
| escape-analysis fixes for the JVM. | |
| Lock contention | Reduce critical-section scope; move to lock-free data structures; per-shard locking. |
| Reflection overhead | Replace dynamic dispatch with cached call-sites or codegen. |
| Logging overhead | Lazy log message construction; level-check before format. |
## Flame graph analysis — `<profile-source>`
**Runtime:** python | jvm | go | node | native
**Profile duration:** Ns (or per-request)
**Top sampled paths:**
| Rank | Sample share | Stack (leaf) | Category |
|-----:|-------------:|--------------|----------|
| 1 | 38% | `JSON.stringify` (in `serializeJson`) | CPU-bound hot algo |
| 2 | 17% | `Array.from` (in `dbQuery`) | Allocator pressure |
| 3 | 15% | `parseRows` (in `dbQuery`) | CPU-bound hot algo |
| 4 | 8% | `jwt.verify` (in `authCheck`) | CPU-bound hot algo (crypto) |
| 5 | 3% | `Buffer.from` (in `serializeJson`) | Allocator pressure |
### Hypothesis
The top hot path (`JSON.stringify` at 38% sample share) is the
load-bearing cost. The serialization path also dominates rank 5
(`Buffer.from` at 3%) — the serialize step accounts for ~41% of
sampled time combined.
### Recommended next step
1. **Switch to a streaming JSON serializer** (e.g. `fast-json-stringify`
in Node, `orjson` in Python, Jackson's `JsonGenerator` in JVM)
— eliminates intermediate string allocation and runs ~2-5x faster
on benchmark-typical payloads.
2. Re-profile after the change; expect rank 1 to drop below 10%.
3. Hand off to [`perf-budget-gate`](../perf-budget-gate/SKILL.md)
to confirm the regression delta closes.
| Rank | Share | Stack (leaf) | Category |
|-----:|------:|-------------------------------|----------|
| 1 | 32% | `gc.collect` | Allocator pressure |
| 2 | 18% | `dict.update` | (callsite) |
| 3 | 14% | `parse_response` | CPU-bound hot algo |
GC at 32% of samples → allocator pressure dominates. The fix isn't
making any one function faster - it's reducing the rate of
allocations from dict.update and parse_response (object pooling,
streaming parsing).
| Rank | Share | Stack (leaf) | Category |
|-----:|------:|-----------------------------------------|----------|
| 1 | 41% | `pthread_mutex_lock` | Lock contention |
| 2 | 12% | `cache.get` | (callsite) |
41% of samples in lock acquisition. The fix is not "make cache.get
faster" - it's "reduce the contention" (per-shard locks, lock-free
structures, or a lock-free cache like Caffeine for the JVM).
| Rank | Share | Stack (leaf) | Category |
|-----:|------:|---------------------------------|----------|
| 1 | 28% | `Method.invoke` / `getattr` | Reflection overhead |
A common surprise - an ORM's reflective field access dominates the
profile of an otherwise simple endpoint. The fix is the
ORM-equivalent of "compile the mapping" - cached method handles in
the JVM, __slots__ in Python, generated SQL in Go.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Reading the SVG visually only, no quantitative data | Easy to mis-judge widths; biases toward dramatic-looking deep stacks. | Always work from folded stacks; sort by sample count. |
| Profiling under-load is too low | One request / second can't expose contention or allocator pressure. | Profile under realistic load - pair with k6-load-testing. |
| Optimizing rank 5 first because rank 1 looks "structural" | Premature optimization; misses the dominant cost. | Always start with rank 1; only descend if rank 1 is genuinely framework-bound (e.g. event_loop). |
| On-CPU profiler for an I/O-bound workload | I/O wait doesn't appear; flame graph shows what's running, not what's waiting. | Use wall-clock / off-CPU profiling for I/O-bound workloads. |
| Single 30-second capture under highly variable load | Sample is unrepresentative. | Capture multiple samples across the load-test duration; merge. |
perf-regression-bisector - upstream agent that bisects to a commit; this skill picks up
inside the commit.k6-load-testing - runner that
produces the load under which the profile is captured.npx claudepluginhub testland/qa --plugin qa-load-testingProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.