From amazon-doc-writer
Use this skill when authoring an Amazon Correction of Errors (COE) document — the post-incident write-up that captures what happened, why, and what will change. Provides the full COE template including incident summary, customer impact, timeline, 5 Whys, and action items. Load alongside the `amazon-writing-style` skill.
How this skill is triggered — by the user, by Claude, or both
Slash command
/amazon-doc-writer:writing-coeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A COE is Amazon's post-incident document. It is a blameless, mechanism-focused
A COE is Amazon's post-incident document. It is a blameless, mechanism-focused narrative that explains: what broke, how customers were impacted, what happened minute-by-minute, the underlying causes (via 5 Whys), and the specific mechanisms that will prevent recurrence.
A COE is read by senior engineers and leadership. The bar is high: vague causes, missing action items, or owner-less commitments will be rejected.
If the user wants a generic post-incident analysis without the 5 Whys /
action-item discipline, consider writing-analysis-report instead.
# COE: <Short Incident Name>
**Author(s):** <names>
**Date of COE:** <YYYY-MM-DD>
**Date of incident:** <YYYY-MM-DD>
**Severity:** SEV-1 | SEV-2 | SEV-3
**Status:** Draft | In Review | Approved
**Service(s) affected:** <…>
## 1. Incident Summary
<2–4 sentences. What happened, when, for how long, and what customer
experience broke. A reader who stops here should know the headline.>
## 2. Customer Impact
<Quantified. Use the format:
- **Affected customers:** <count, % of total, geography>
- **Duration of impact:** <start UTC → end UTC, total minutes>
- **What customers experienced:** <prose — error messages, failed operations,
data loss, degraded performance, etc.>
- **Money / SLA impact:** <credits issued, contracts breached, $ estimate>
If a number is unknown, say "unknown" and add it to Open Questions — do not
omit the line.>
## 3. Incident Timeline (all times UTC)
| Time (UTC) | Event |
| --- | --- |
| YYYY-MM-DD HH:MM | <trigger / change / first signal> |
| HH:MM | <detection — who/what detected it> |
| HH:MM | <escalation> |
| HH:MM | <mitigation attempted> |
| HH:MM | <mitigation effective> |
| HH:MM | <full recovery / all-clear> |
<Include detection-time, time-to-engage, time-to-mitigate, time-to-recover
metrics derived from the timeline.>
## 4. What Happened
<Narrative explanation of the failure mode. Walk through the system from
trigger → propagation → customer impact. Reference the timeline. Include or
link diagrams of the affected flow. Be specific: name the components, the
configurations, the limits hit.>
## 5. 5 Whys (Root Cause Analysis)
Start from the customer-visible failure and ask "why?" at least five times.
Branch where multiple causes contribute.
1. **Why did customers see <symptom>?** — Because <…>
2. **Why did <that> happen?** — Because <…>
3. **Why?** — Because <…>
4. **Why?** — Because <…>
5. **Why?** — Because <root cause, typically a missing mechanism>
<Where there are multiple contributing root causes (e.g. a latent bug + a
deployment process gap + a missing alarm), run a separate 5-Whys branch for
each. Each branch must terminate in a missing or broken mechanism, not in a
human error.>
## 6. What Went Well
<Specific things that limited the blast radius or sped recovery —
e.g. "throttling at the edge contained impact to one region",
"runbook for X was up to date and used". Keep honest and short.>
## 7. What Went Wrong
<Specific gaps, beyond the root cause itself — detection lag, missing
runbook, unclear ownership, alarm fatigue, etc.>
## 8. Action Items
| # | Action item | Type | Owner | Due date | Tracking link |
| - | --- | --- | --- | --- | --- |
| 1 | <specific, verifiable change> | Prevent / Detect / Mitigate / Process | <name or team> | YYYY-MM-DD | <ticket> |
<Rules:
- Every action item must be specific and verifiable ("Add alarm on X with
threshold Y, page sev-2"), not aspirational ("improve monitoring").
- Every item has an owner (person or single team) and a due date.
- Tag each as Prevent (stops recurrence), Detect (catches it sooner),
Mitigate (reduces impact), or Process (changes how we work).
- Prefer Prevent + Detect over Mitigate + Process.
- "Add training" / "be more careful" are not valid action items.>
## 9. Lessons Learned
<2–4 short paragraphs. What does the org now understand that it didn't
before? What patterns elsewhere in our systems share this failure mode and
should be audited?>
## 10. Open Questions
- <unresolved facts and who owns resolving them>
## Appendix
- A. Full timeline with log excerpts
- B. Graphs (latency, error rate, traffic) for the incident window
- C. Related tickets, alarms, and prior COEs with similar root causes
## Sources
- `<relative path>` — <what it provided>
amazon-writing-style self-review checklist.amazon-writing-style self-review checklist passes.npx claudepluginhub louleowk/awesome-plugins --plugin amazon-doc-writerCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.