From enterprise-software-playbook
Triage and diagnose production or local issues by following logs → traces → metrics (HTTP/gRPC/async). Use when investigating errors, latency spikes, 5xx responses, SLO violations, or regressions in an instrumented app. NOT for adding new instrumentation (use observability); NOT for applying resilience patterns (use resilience).
How this skill is triggered — by the user, by Claude, or both
Slash command
/enterprise-software-playbook:debugThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill is for **debugging** with existing telemetry. It does **not** focus on adding instrumentation (use `observability` when telemetry gaps block triage).
This skill is for debugging with existing telemetry. It does not focus on adding instrumentation (use observability when telemetry gaps block triage).
Goal: turn “something is broken/slow” into:
Inputs: Symptom description (what's failing/slow); access to logs, traces, and metrics; environment and time window.
Outputs: Root cause diagnosis with evidence, mitigation action, fix plan, learning capture. Feeds follow-ups to observability, resilience, platform, or architecture.
Capture:
local/dev/staging/prod) and time window (start/end).traceId, requestId, spanIdop (route template / RPC method / job name / message type)Copy/paste helpers live in references/commands.md.
If you have a traceId, use it.
op).service A is timing out calling service B method X”Y is slow / missing index / deadlocked”T (poison message)”If you cannot find/interpret traces, fall back to logs + metrics and consider adding missing telemetry via observability.
Use metrics to answer:
Start with RED for the boundary (HTTP route / gRPC method / consumer group).
GATE: Failure propagation (step 4) must be mapped before deciding mitigate vs investigate. If you don't know what breaks next, you can't assess the urgency of mitigation.
If impact is high and evidence points to a recent change:
If impact is moderate or unclear:
If you found a systemic gap, capture it:
observabilityresilienceplatformarchitectureWhen context or time is constrained, these are the load-bearing steps:
Steps that can be cut under pressure: detailed metric blast-radius analysis (step 3 depth), organizational cascade mapping (step 4 breadth).
references/commands.mdreferences/scenarios.md../references/structured-thinking-templates.mdobservabilityWhen using this skill, return:
../references/structured-thinking-templates.md — Retrospective / Postmortem).npx claudepluginhub bricerising/enterprise-software-playbook --plugin enterprise-software-playbookGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.