From ai-infra-auto-driven-skills
Replay-first debug flow for SGLang serving problems with health-check failures, latency regressions, queue growth, timeouts, crash dumps, or PD/EP/HiCache issues. Collects baseline bundles before profiling.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-infra-auto-driven-skills:sglang-prod-incident-triageThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill to turn a live serving problem into a debug path you can replay.
Use this skill to turn a live serving problem into a debug path you can replay.
Use one loop:
Do not start with profiling.
This skill should work with more focused skills instead of re-implementing them:
debug-cuda-crash when replay plus coredump points to a CUDA crash pathdebug-distributed-hang when the problem is clearly a TP/PP/DP/EP hangllm-torch-profiler-analysis when the issue is already narrowed to a
compute-side pathThree examples are included:
Return:
/health or /health_generate is unhealthyIf a live server is reachable, collect a read-only bundle before anything more intrusive:
python3 scripts/incident_artifact_tool.py collect-bundle \
--base-url http://127.0.0.1:30000 \
--outdir /tmp/incident_bundle
python3 scripts/incident_artifact_tool.py summarize-bundle \
/tmp/incident_bundle
If the server is protected:
python3 scripts/incident_artifact_tool.py collect-bundle \
--base-url http://127.0.0.1:30000 \
--token "$SGLANG_BEARER_TOKEN" \
--outdir /tmp/incident_bundle
The bundle script collects:
/health/health_generate/model_info/server_info/v1/loads?include=all/v1/loads?include=core,queues,disagg,spec/metrics/hicache/storage-backend on a best-effort basisUse the summary for a quick read on:
If the summary says the bundle was captured while the server was idle, recollect it during traffic or move quickly to dump plus replay.
If no live server is reachable, start from the best dump or log already available:
Read references/decision-tree.md only if the problem class is still unclear:
Then preserve the request payload that actually triggers the problem:
--crash-dump-folderDo not jump straight from a live symptom to low-level debugging without first saving something you can replay.
Read references/endpoints-and-signals.md when you need help reading the baseline bundle or the replay target.
Read references/replay-trace-profile.md when you need the replay, trace, profile, or bisect paths.
Standard order:
Use replay when:
If a crash dump exists, summarize it first:
python3 scripts/incident_artifact_tool.py summarize-dump \
--input-file /path/to/crash_dump.pkl
Then replay:
python3 /path/to/sglang/scripts/playground/replay_request_dump.py \
--input-file /path/to/crash_dump.pkl \
--host 127.0.0.1 \
--port 30000 \
--parallel 128
If safe_pickle_load blocks a locally captured trusted dump, use:
python3 scripts/replay_trusted_request_dump.py \
--input-file /path/to/request_dump.pkl \
--host 127.0.0.1 \
--port 30000 \
--parallel 1
If replay indicates a CUDA crash path, restart the same build with coredumps enabled before reproducing again:
SGLANG_CUDA_COREDUMP=1 \
SGLANG_CUDA_COREDUMP_DIR=/tmp/sglang_cuda_coredumps \
python -m sglang.launch_server \
--model-path ... \
--crash-dump-folder /tmp/sglang_crash_dump \
...
Then inspect the generated coredump:
cuda-gdb "$(which python3)" \
-ex "target cudacore /tmp/sglang_cuda_coredumps/cuda_coredump_<host>.<pid>.<ts>"
For a replay-first crash example, read references/case-studies.md.
Use tracing when:
If tracing was enabled at startup, you can change the level without restart:
curl "http://127.0.0.1:30000/set_trace_level?level=1"
curl "http://127.0.0.1:30000/set_trace_level?level=2"
Use profiling when:
At that point, switch to llm-torch-profiler-analysis. Do not duplicate
its profiling workflow here.
For a low-noise latency example, read references/case-studies.md.
If this looks like a collective stall, save the failing request, replay it on a
clean target, collect the replay-time bundle and stacks, then switch to
debug-distributed-hang.
For an example of that flow, read references/case-studies.md.
If one commit is known-good and another is known-bad, build a deterministic harness before doing deeper manual debugging:
0 on good behavior and non-zero on bad behaviorgit bisect start <bad> <good>git bisect run <harness>Prefer replay-backed bisect when the regression depends on request shape or long-running serving state.
Switch tools once the fault class is clear:
llm-torch-profiler-analysis for kernel and overlap attributiondebug-distributed-hang for collective or rank-divergence hangsdebug-cuda-crash for CUDA crash reproduction and kernel API loggingDo not switch tools before collecting the first bundle unless the user already has decisive logs or dumps.
Load only what the current step needs:
safe_pickle_load blocks stock replayIf a live bundle was collected, include its path.
If replay, trace, or profiling was chosen, say why bundle plus dump were not enough.
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsDiagnoses ML/AI failures like OOM, NaN, divergence, crashes, bad throughput, wrong outputs, and dependency conflicts using grounded framework docs and citations.
Autonomously improves SGLang LLM serving performance via RLCR loop: benchmarks against vLLM/TensorRT-LLM, profiles bottlenecks, patches SGLang code, and revalidates until competitive under same SLA.
Investigates errors, failures, and unexpected behavior by gathering evidence, triaging subsystems, and producing a structured debug report. Read-only — does not modify code.