From bito-ai-architect
Diagnose production incidents by gathering cross-repo context, mapping blast radius, and building a structured remediation plan using AI Architect
How this skill is triggered — by the user, by Claude, or both
Slash command
/bito-ai-architect:production-triageThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> ⚠️ **Requires:** BitoAIArchitect MCP server configured and running. Run `/setup-bito` first if not configured.
⚠️ Requires: BitoAIArchitect MCP server configured and running. Run
/setup-bitofirst if not configured.
Systematically diagnose and triage a production issue by leveraging AI Architect to understand service topology, trace dependency chains, identify blast radius, and surface root cause candidates. This skill turns a reactive "grep logs and hope" approach into a structured diagnostic workflow grounded in actual system architecture.
This skill is architecturally different from the planning skills. Feature Plan, PRD, and TRD are generative workflows (build something new). Production Triage is a diagnostic workflow (understand something broken). The phases are oriented around narrowing down the problem, not building up a solution.
flowchart TB
start["START<br/>Receive Incident Report"]
phase1["Phase 1<br/>Capture Symptoms"]
phase2["Phase 2<br/>Map Service Topology<br/>(MANDATORY)"]
phase3["Phase 3<br/>Trace Dependency Chain<br/>& Blast Radius"]
checkpoint1["CHECKPOINT 1<br/>Present Topology & Blast Radius<br/>Confirm Scope"]
phase4["Phase 4<br/>Generate Hypotheses<br/>& Diagnostic Plan"]
checkpoint2["CHECKPOINT 2<br/>User Validates Hypotheses"]
phase5["Phase 5<br/>Root Cause Analysis<br/>& Remediation"]
done["DONE<br/>Deliver Triage Report"]
start --> phase1
phase1 --> phase2
phase2 --> phase3
phase3 --> checkpoint1
checkpoint1 --> phase4
phase4 --> checkpoint2
checkpoint2 --> phase5
phase5 --> done
The ONLY valid terminal state is DONE. You MUST pass through every phase and checkpoint in order. There are no shortcuts.
| Rationalization | Why It's Wrong |
|---|---|
| "The error message tells me the root cause" | Error messages show symptoms, not causes. A database timeout in Service A might be caused by a memory leak in Service B that shares the connection pool. |
| "I know which service is broken from the logs" | You know which service is failing. The root cause may be upstream, downstream, or in shared infrastructure. You need the dependency chain. |
| "This is a simple bug — no need for topology mapping" | Simple bugs in one service often have cascading effects across dependent services. You need blast radius even for "simple" issues. |
| "I'll just look at the service that's alerting" | The alerting service is the victim, not necessarily the cause. AI Architect's dependency mapping reveals the actual causal chain. |
| "Time is critical — skip the context gathering" | 10 minutes of structured diagnosis saves hours of random debugging. Skipping topology mapping leads to fixing symptoms, not causes. |
This skill applies to EVERY production issue regardless of perceived severity or simplicity.
Gather all available information about the incident:
Structure symptoms as:
### Symptom Summary
- **Impact**: [What users are experiencing]
- **Onset**: [When it started, sudden vs. gradual]
- **Scope**: [Who/what is affected]
- **Error Signals**: [Error messages, HTTP status codes, log entries]
- **Recent Changes**: [Deployments, config changes in the window]
If the user hasn't provided enough symptom data, ask clarifying questions before proceeding.
Do NOT proceed to Phase 3 until you have mapped the service topology around the affected area using AT LEAST 4 AI Architect queries. Triage without topology mapping is INVALID.
You MUST create a task checklist and complete each item:
Identify the Affected Service(s) — Which service(s) are directly exhibiting the symptoms?
searchRepositories for service keywords from error messages/logsgetRepositoryInfo with full detail for the affected serviceMap Upstream Dependencies — What services call the affected service? If the affected service is degraded, who is impacted?
getRepositoryInfo with includeIncomingDependencies for the affected servicelistClusters to understand the service's cluster membershipMap Downstream Dependencies — What does the affected service depend on? If a dependency is the root cause, this is how you find it.
getRepositoryInfo with includeOutgoingDependencies for the affected serviceIdentify Shared Infrastructure — What databases, caches, message queues, or shared services do the involved services share? Shared resources are common root causes.
searchRepositories for database, cache, queue, and infrastructure reposgetRepositoryInfo for shared infrastructure componentsUsing the topology from Phase 2, build:
Trace the path from the symptom back to potential root causes:
[User-Facing Symptom]
→ [Service A: exhibiting errors]
→ [Service B: upstream caller, also affected]
→ [Service C: downstream dependency]
→ [Database D: shared resource, potential bottleneck]
→ [Service E: another downstream dependency]
Map every service and system affected by this incident:
### Blast Radius
**Directly Affected**:
- [Service A]: [symptom]
- [Service B]: [symptom]
**Indirectly Affected** (dependent on affected services):
- [Service C]: [potential impact]
- [Service D]: [potential impact]
**Shared Infrastructure at Risk**:
- [Database X]: [used by N services]
- [Cache Y]: [used by N services]
**Unaffected** (confirmed isolated):
- [Service E]: [why it's isolated]
Present the dependency chain and blast radius to the user.
Ask the user: "Here's the service topology and blast radius I mapped. Does this match what you're seeing? Are there any services I'm missing or any additional symptoms?"
Do NOT proceed until confirmed.
Based on symptoms (Phase 1) + topology (Phase 2-3), generate ranked hypotheses:
### Hypothesis [N]: [One-line description]
**Likelihood**: High / Medium / Low
**Reasoning**: [Why this could be the cause, based on symptoms + topology]
**Evidence For**: [What symptoms support this]
**Evidence Against**: [What symptoms contradict this]
**Diagnostic Steps**:
1. [Specific check to confirm or rule out]
- Where to look: [service, log, metric, dashboard]
- What to look for: [specific pattern, value, or condition]
2. [Next check if step 1 is inconclusive]
- ...
**If Confirmed — Immediate Mitigation**:
- [What to do RIGHT NOW to stop the bleeding]
Generate at least 3 hypotheses, ranked by likelihood. Each must have concrete diagnostic steps — not vague suggestions like "check the logs."
Use AI Architect to inform hypotheses:
searchSymbols for error handling code in affected services — understand what error paths existgetCode for retry logic, circuit breakers, timeout configs — understand resilience behaviorsearchSymbols for recent migration files or config changes — correlate with onset timePresent hypotheses and diagnostic plan. Ask: "Do these hypotheses make sense? Should I prioritize any differently? Do you have additional data that confirms or rules out any of these?"
Do NOT proceed until the user provides feedback.
Based on user feedback and diagnostic results, produce the final triage report.
# Production Triage Report: [Incident Title]
## 1. Incident Summary
- **Severity**: [P0-P4]
- **Impact**: [User-facing impact description]
- **Duration**: [Start time → resolution time or ongoing]
- **Affected Services**: [List]
- **Affected Users/Scope**: [Who was impacted]
## 2. Symptom Timeline
| Time | Event |
|---|---|
| [timestamp] | [First symptom observed] |
| [timestamp] | [Escalation / additional symptoms] |
| [timestamp] | [Diagnosis began] |
| [timestamp] | [Root cause identified] |
| [timestamp] | [Mitigation applied] |
## 3. Service Topology & Blast Radius
[From Phase 3 — dependency chain and blast radius map]
## 4. Root Cause Analysis
### Root Cause
[Clear, specific description of what went wrong and why]
### Causal Chain
[Trigger event] → [First effect] → [Cascading effect] → [User-facing symptom]
### Contributing Factors
- [Factor 1]: [How it contributed]
- [Factor 2]: [How it contributed]
### Why It Wasn't Caught Earlier
- [Gap in monitoring / testing / review that allowed this]
## 5. Hypotheses Evaluated
| Hypothesis | Verdict | Key Evidence |
|---|---|---|
| [Hypothesis 1] | ✅ Confirmed / ❌ Ruled Out | [What proved or disproved it] |
| [Hypothesis 2] | ✅ Confirmed / ❌ Ruled Out | [What proved or disproved it] |
| ... | ... | ... |
## 6. Immediate Mitigation Applied
- [What was done to stop the incident]
- [Temporary workarounds in place]
## 7. Permanent Fix Recommendations
### Short-Term (This Sprint)
| Fix | Repo | Description | Effort |
|---|---|---|---|
| ... | ... | ... | S/M/L |
### Medium-Term (Next 2-4 Weeks)
| Fix | Repo | Description | Effort |
|---|---|---|---|
| ... | ... | ... | S/M/L |
### Long-Term (Architecture Improvements)
| Fix | Repo(s) | Description | Effort |
|---|---|---|---|
| ... | ... | ... | S/M/L |
## 8. Prevention Measures
### Monitoring Gaps to Close
- [New alert / dashboard / metric to add]
### Testing Gaps to Close
- [New test scenarios to add]
### Process Gaps to Close
- [Deployment checks, review steps, runbook updates]
## 9. Cross-Repo Impact of Fixes
| Repo | Change Required | Risk | Notes |
|---|---|---|---|
| ... | ... | Low/Med/High | ... |
## 10. Open Questions
- [Unresolved items needing further investigation]
npx claudepluginhub agdas/bito-plugins-marketplace --plugin bito-ai-architectDiagnoses production incidents by detecting environment, gathering symptoms, reading logs with Grep/Bash, checking metrics, tracing requests to find root causes and propose fixes with rollbacks.
Incident response — diagnose production issues, find root cause, propose fix with rollback. Use when asked about "something is broken", "production issue", "why is this down", "incident", or "debug production".
Responds to production incidents using a structured workflow: classify severity, triage impact, mitigate, root-cause, and write a blameless post-mortem. Use for outages, production issues, or security incidents.