From octoperf
Use when an OctoPerf load-test scenario has completed (or is running) and the user wants to understand why it failed, underperformed, or behaved unexpectedly. Triggers on "the load test failed", "why are response times so high", "high error rate in the scenario", "diagnose this bench", "the run looks bad". Walks the LLM through reading global metrics, narrowing scope, comparing against validation, and surfacing the right next step (re-validate, tune scenario, fix infra). Requires the OctoPerf MCP server and a `benchResultId` to investigate.
How this skill is triggered — by the user, by Claude, or both
Slash command
/octoperf:octoperf-scenario-diagnosisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A scenario run produced metrics that look bad — high error rate, high
A scenario run produced metrics that look bad — high error rate, high response times, low throughput, premature stop. This skill walks the diagnosis: read metrics → narrow down → match the symptom to one of four root-cause classes → surface the right fix.
You need a benchResultId from one of:
mcp__octoperf__run_scenario(scenarioId).mcp__octoperf__list_bench_reports_by_project(projectId) filtered on benchResultIds for the UI deep-link.If mcp__octoperf__get_bench_status(benchResultId) shows progress < 1.0
the test is still running. Either wait and re-check, or surface what
has been measured so far with the caveat that it may change.
Before reading metrics, confirm the run actually produced samples.
run_scenario can fail before any HTTP traffic is generated —
infrastructure error, no matching plan, deserialisation issue,
configuration rejected. A diagnosis built on metrics from a run that
never started will mislead the user.
mcp__octoperf__get_bench_result(benchResultId)
The exhaustive state machine is CREATED → PENDING → SCALING → PREPARING → INITIALIZING → (ERROR | RUNNING) → (FINISHED | ABORTED).
Any other label is a transport / UI artefact.
state = FINISHED → proceed to step 1.
state = ABORTED → either manual stop or stall-abort; jump to
the jmeter.log signature catalogue.
state = ERROR → the run errored during provisioning or startup,
no samples to read. Pull the orchestration logs:
mcp__octoperf__list_bench_docker_logs(benchResultId)
Common ERROR-state causes:
get_scenario_matching_plans(scenarioId)
(empty result) + list_active_subscriptions() (lists caps).
The binding cap is usually maxRealBrowserUsers=0 on basic
plans rejecting a Playwright UserProfile, or
maxProfilesPerScenario rejecting a multi-VU hybrid.octoperf-validation-triage.mcp__octoperf__list_bench_reports_by_project(projectId)
# pick the report tied to your benchResultId, then
mcp__octoperf__get_bench_report(reportId)
# locate the SummaryReportItem in the returned items list, then
mcp__octoperf__get_report_summary_values(reportId, summaryItemId)
The default report's SummaryReportItem aggregates the test-wide values: average response time, percentiles (p50/p90/p95/p99), hits per second, total error rate, error count by type, total transactions, throughput. Don't dive into per-action data yet — the global view tells you which class of problem you're in.
Trust caveat — load-generator overload. Before reading any response
time, check whether the bench report has a MonitoringAlarmsReportItem
firing on the load generators (CPU / memory / load average). If it
fires, the response times in the report are underestimated: the
load generator itself was the bottleneck, and JMeter's internal timing
becomes unreliable. Surface this as a confidence caveat ("response
times are suspect — LG was overloaded") before drawing conclusions,
and suggest re-running on the cloud or on a larger LG.
Trust caveat — cache hits skew the global numbers. JMeter's
CacheManager is enabled by default. When a recorded VU hits the
same URL repeatedly (typical on a session that revisits pages), the
server returns HTTP 304 Not Modified and JMeter records the sample —
but the response time / throughput then reflect a cache check, not
real load on the SUT. If get_report_pie_values on the response-codes
widget shows more than ~40% 304s, flag it: the visible numbers are an
optimistic floor, the real SUT cost lives in the 200 samples. To
diagnose the SUT, filter to status=200 when drilling into per-action
metrics.
Trust caveat — fail-fast peaks. When the response-codes pie shows a peak of errors correlated with the hit-rate peak, the server is failing fast (errors return short, cheap responses). The apparent throughput spike is illusory — read the error rate before the hit rate when a chart shows a sudden bump.
LG monitoring caveats.
G1 Old / collectionCount on the LG-JVMs
widget before recommending more LGs.%UsedMemory alerts essentially never fire (OctoPerf
pre-provisions). When they fire on an on-prem agent, another
process on the host is the cause — the JVM alone won't trigger it.LoadGeneratorsChartReportItem (hosts) on an on-prem run
usually means IP Spoofing is enabled on that LG, which disables
agent monitoring entirely — not "no data".OctoPerf ships a InsightsReportItem in the default report — call
get_report_insights and let the platform classify the run for you.
One call returns up to ~15 insights tagged by severity (ERROR /
WARN / INFO / PASSED) with the heuristic's numeric value. This
is the fastest path to a classification — often skips the manual
table lookup in step 2.
mcp__octoperf__get_report_insights(reportId, insightsItemId)
Mapping of common InsightIds to root-cause classes (look at the
ones tagged ERROR or WARN first):
InsightId | Severity at fire | What it means | Where to look next |
|---|---|---|---|
RESPONSE_TIME_GLOBAL_AVG | INFO/WARN/ERROR | RT drifts from the test's own average. Severity scales with deviation | get_report_area_range_values on the linked inspect widget |
RESPONSE_TIME_STD_DEVIATION | INFO/WARN/ERROR | Wide spread between p50 and p99 — user experience is inconsistent | get_report_line_chart_values on a PercentilesChartReportItem |
STEP_BY_STEP_RESPONSE_TIME | INFO/WARN/ERROR | One or two actions much slower than the rest — bottleneck on a specific endpoint | get_report_top_values on Top Response Times |
CONNECT_TIME_VS_RESPONSE_TIME | INFO/WARN/ERROR | Connect time is a large fraction of RT → TLS handshake / no keep-alive / connection pool small | Check HTTP server config: enable keep-alive, increase resourcesPool |
LATENCY_VS_RESPONSE_TIME | INFO/WARN/ERROR | Latency (server time to first byte) is the bulk of RT → server-side processing bottleneck | Surface to user — check SUT thread pool, DB, GC |
HIT_RATE_INFLEXION_POINT | WARN/ERROR | Hits/s plateaus before the VU count plateaus — SUT reached a soft cap mid-ramp | Confirm with VU-vs-hits chart; the inflexion x-axis is the saturation point |
STEP_BY_STEP_ERRORS | INFO/WARN/ERROR | Errors concentrated on a few actions | get_report_top_values on Top Error Percentages; then drill via get_report_errors |
PEAK_OF_ERRORS | INFO/WARN/ERROR | Errors spike at one moment (e.g. ramp inflexion) | get_report_line_chart_values (USERLOAD + ERRORS_RATE) — what changed at that time? |
OVERALL_ERROR_4XX/5XX/NONE | INFO/WARN/ERROR | Global error rate per family | get_report_pie_values on the response-codes pie |
THROUGHPUT_IMAGE_NEW_FORMAT | INFO | Old image formats (JPEG/GIF) dominate | Optimisation hint, not a perf bottleneck |
THROUGHPUT_IMAGE_OPTIMIZE / THROUGHPUT_CSS / THROUGHPUT_JAVASCRIPT | INFO | Bandwidth eaten by un-minified or un-compressed static assets | Optimisation hint |
THRESHOLD_ALARM | (varies) | A user-configured ThresholdAlarmReportItem fired (e.g. SLA breached) | The widget's metric tells you which monitor crossed which threshold |
Each Insight carries a more widget (the visual context) and
sometimes an inspect widget (the drill-down comparison —
typically an AreaRangeChartReportItem). Use the matching
get_report_*_values on those to surface evidence for the user.
Note that the same numeric value that flagged the insight is
also the rmse field of the AreaRangeChartReportItem linked from
inspect — they're literally the same heuristic, exposed twice.
Match the metrics against one of these patterns:
| Pattern | Likely class | Where to look next |
|---|---|---|
| High error rate (>5%), low load | Functional regression | Validation skill — the VU itself is broken; don't analyze perf |
| High error rate (>5%), high load | System under stress | Server-side capacity / config — surface to user, OctoPerf-side fix is rare |
| Low errors, p95 climbing with load | Bottleneck (application) | Per-action / per-server metrics to identify the slow request |
Low errors, p95 flat-high, CONNECT_TIME_VS_RESPONSE_TIME fires | Bottleneck (infra: TLS / keep-alive) | Check the HTTP server config — enable keep-alive, increase resourcesPool. Symptom: connect time = 40%+ of RT |
Low errors, p95 flat-high, LATENCY_VS_RESPONSE_TIME fires | Bottleneck (server-side processing) | SUT-side: thread pool, DB pool, GC. Surface to user — outside OctoPerf control |
Hits/s plateaus before VU plateau (HIT_RATE_INFLEXION_POINT WARN+) | Soft cap (knee point) | The SUT saturates mid-ramp — capacity is below the configured load. Re-run at the inflexion x-axis VU count to confirm the knee |
| Low errors, flat p95, low throughput | Scenario misconfigured | User-load profile (list_scenarios_by_project → scenario detail in UI) |
| Errors concentrated on one action | Specific endpoint broken | Re-validate the VU; that action probably already fails functionally |
| Errors only at start of run | Warmup / cache cold | Surface to user; rerun with warmup or longer test |
| Errors only at end of run | Resource exhaustion | Memory / connection / DB-pool — server side |
| Recurring per-minute error pulses | Synchronicity artefact (not a VU bug) | Fixed think-times → bursts of same-second requests, or an infra cron (firewall / WAF heartbeat). Randomise thinktime to disambiguate before assuming auth/state issues |
| Test stopped early | Killed or planned stop | jmeter.log signature distinguishes which — see jmeter.log signature catalogue |
Smoke-vs-load heuristic. If a low-VU smoke run (1 user, 10-20 iterations) exists for the same scenario / VU, compare the per-action error rates:
If no smoke baseline exists, propose creating one via
create_scenario_ramp_up (users=1, rampUpSec=0) before running at full
scale — that's an order of magnitude cheaper than diagnosing failures
post-hoc.
If you classified the run as functional regression or "specific endpoint broken", don't try to fix it from metrics. Validation has the HTTP-level detail you need.
mcp__octoperf__list_virtual_users(projectId)
# Identify the VU(s) used by the scenario from the scenario detail
mcp__octoperf__get_virtual_user_validation(projectId, virtualUserId)
If the latest validation is also failing, the VU is broken regardless of load — hand off to the validation-triage skill. If validation is clean, the breakage emerged under concurrent load (race condition, test data exhausted, rate limit hit) — that's a finding to surface to the user, not a VU-edit task.
End with a clear summary the user can act on:
get_report_summary_values / get_report_*_values that back the verdict.validate_virtual_user — validation is failing too."For anything beyond high-level diagnosis (per-percentile graphs, per-monitor metrics, custom dashboards) the OctoPerf UI report is far better than another tool call:
mcp__octoperf__list_bench_reports_by_project(projectId)
Filter the result on benchResultIds to find the reports tied to this
run and render their url as Markdown links so the user can open them.
When a scenario has multiple UserProfiles (e.g. N×JMeter for load +
1×Playwright probe — see octoperf-real-browser-probe), the default
report's StatisticTableReportItem aggregates across all of them.
Use the tree variant to split by VU:
mcp__octoperf__get_report_tree_values(reportId, statisticTreeItemId)
Each TreeEntry carries a virtualUserId — group by it before
reading. Two important caveats when reading a hybrid run:
1. Don't compare Playwright vs JMeter per-action timings naïvely. For the same target URL:
page.goto duration including JS exec + render, or just
page.click time = the click + paint, the resulting navigation is
separate). The numbers reflect different things.The best apples-to-apples are the first homepage visit (Playwright's
page.goto('/') order 9 vs JMeter's GET / parent) — and even
then, the Playwright probe is 1 VU vs N JMeter VUs, so no contention.
2. Playwright tree rows have multiple types — learn the
hierarchy. The Playwright VU rows in the tree are labelled with a
{label, type} suffix on the actionId. Types you'll see:
| Type | What it measures |
|---|---|
| (bare id, no suffix) | Wall-clock per spec iteration (the source of truth for user-perceived journey duration) |
GROUP (label=Actions) | Sum of all ACTION durations (overlaps with Network since Playwright is async) |
GROUP (label=Network) | Cumulative time spent in HTTP requests per iteration |
HOOK (label=Before/After Hooks()) | Playwright setup / teardown overhead |
ACTION (label=page.X(...)) | Individual Playwright command duration |
EXPECT (label=expect.X(...)) | Individual assertion duration |
NETWORK (label=<host>) | Aggregate of every individual HTTP request the browser made (hits=total requests) |
NAVIGATION | Whole-page nav timing (DOM ready, load) |
Don't sum types to compute total time — they overlap (Playwright
is async). The bare actionId row is the wall-clock. If a hybrid
scenario reports Network GROUP = 2.5s and Actions GROUP = 1.5s,
the real per-iteration time is not 4s — the bare row will show
~1.4s (the actual wall-clock).
Hits (CONTAINER) vs Hits in TopReportItem. When
get_report_top_values returns the container actionId at the top
(in our case e3331762-... for the JMeter VU's root container), the
value is the whole iteration's elapsed time — including thinktime
and all sub-actions. That's expected, not an anomaly. To find the
slowest real action, ignore the container and any .resources
aggregate.
Log retention. JMeter / Playwright log files are erased 7 days after the run, or as soon as the user leaves the design screen. Zipped logs >200 MB are dropped entirely. Old runs may have no logs to read — confirm freshness before promising a re-read.
When the test stopped earlier than its planned duration (or finished with the
"no more users running" pattern), pull jmeter.log via
list_bench_result_files + read_bench_result_file_lines and grep for one of
these signatures to tell why it ended:
| Log signature | Meaning | What to tell the user |
|---|---|---|
Thread finished: <id> repeated as threads drain before duration ends | End of iterations — VUs hit their max-iterations policy | Planned stop. Increase iterations or duration if more load was expected. |
End of file:resources/<csv>.csv detected for CSV DataSet:<name> ... stopThread:true | CSV exhausted — End-of-File policy = Stop VU | Planned stop. Either grow the CSV file or change EOF policy to Recycle / Continue. |
Shutdown Test detected by thread: <id> | On-Sample-Error policy fired — at least one sample failed and the policy is Stop VU / Stop test | Policy-driven stop. Check the failing samples to decide whether to relax the policy. |
<user>@octoperf.com aborted the test then Test status changed: RUNNING => ABORTED | Manual abort by a user in the UI | Operator action. Nothing to fix on the VU side. |
WARN Aborting stall test (expected end time:<ts> is past now) | Stall abort — batch killed an unresponsive run 20 min past its planned end | The test stopped responding to shutdown signals (long loops / scripts / timeouts). Inspect per-action metrics for the culprit. |
o.a.j.JMeter: Command: Shutdown received from /127.0.0.1 (and nothing else) | Load generator killed — container shutdown from outside JMeter | Infra event (on-premise agent stopped, container removed, OOM). Re-run on the cloud or a healthier agent. |
ERROR: java.lang.RuntimeException: Failed to perform cmdline operation: jmeter-plugins.org | Plugin download failure (on-premise agent without internet) | Provide the plugin JAR via the project files menu, or disable plugin download in the on-premise infra settings. |
Read the tail of jmeter.log first (the last 100 lines usually carry the
shutdown signature), then scan from the top if the cause isn't terminal.
Per-sample network failure signatures are a different beast — they show
up in the Errors report item / get_report_errors (response code = -1 UNKNOWN) rather than as test-stopping events. When the LLM sees these in
the errorMessage field of a BenchError, the cause is not in the VU
and is not under OctoPerf's control:
Java exception in errorMessage | Cause | What to tell the user |
|---|---|---|
javax.net.ssl.SSLException: Connection reset | TCP connection torn down mid-handshake | Target rejected the connection. Most often DDoS protection / WAF rate-limiting the LG IPs. |
java.net.SocketTimeoutException: Read timed out | Server accepted but didn't answer within timeout | Target is overloaded or the action's response timeout is too tight. Check server monitoring first. |
javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake | Server cut the TLS handshake | Almost always WAF / DDoS protection. Suggest dedicated IPs or whitelist the LG IP ranges. |
org.apache.http.NoHttpResponseException: <host> failed to respond | TCP connection alive but server returned no HTTP response | Server-side request handling failed silently — often a saturated thread pool. |
org.apache.http.conn.HttpHostConnectException: ... Connection timed out | LG couldn't open a TCP connection within timeout | Network path issue (firewall, routing, target IP unreachable from the LG region). |
java.net.NoRouteToHostException: No route to host | LG cannot reach the target at all (no route in routing tbl) | Target requires IP whitelisting — suggest the LG region's IP range or a dedicated IP. |
When two or more of these appear under load but not in a smoke run, the diagnosis is almost always server-side mitigation kicking in (rate-limit, WAF, DDoS-protection) rather than a real capacity issue.
run_scenario is destructive (consumes credits) and rarely the right next move during diagnosis.get_bench_status < 1.0, label the read as preliminary and offer to come back.octoperf-validation-triage — when the VU itself is failing.octoperf-auto-correlation — when failures are session/auth-state related.Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub octoperf/octoperf-claude-plugins --plugin octoperf