From systems-design
Designs chaos engineering experiments for systems: identifies failure modes, defines steady states, creates hypotheses, and generates GameDay plans for resilience testing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/systems-design:chaos-planThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This command helps design chaos engineering experiments and GameDay plans for a system.
This command helps design chaos engineering experiments and GameDay plans for a system.
Generate comprehensive chaos engineering plans including:
First, understand the system:
If a system/service name is provided:
Analyze architecture:
System Analysis Checklist:
□ Service boundaries and responsibilities
□ External dependencies (databases, APIs, queues)
□ Internal service dependencies
□ Data flows and critical paths
□ Current resilience patterns in place
□ Existing monitoring and observability
Identify potential failure modes:
Infrastructure failures:
Application failures:
Dependency failures:
Data failures:
Define what "healthy" looks like:
Identify key metrics:
Steady State Metrics:
Request-based (RED):
- Request rate: [baseline] requests/sec
- Error rate: < [threshold]%
- Duration (p99): < [threshold]ms
Resource-based (USE):
- CPU utilization: < [threshold]%
- Memory utilization: < [threshold]%
- Queue depth: < [threshold]
Business metrics:
- [Metric 1]: [baseline/threshold]
- [Metric 2]: [baseline/threshold]
Design experiments for identified failure modes:
For each priority failure mode, create:
Experiment: [Name]
Hypothesis:
"When [fault condition] occurs, [system component] will
[expected behavior] because [reasoning]."
Fault Injection:
- Type: [Latency/Error/Termination/Partition/etc.]
- Target: [Service/instance/dependency]
- Magnitude: [Degree of fault]
- Duration: [How long]
Blast Radius:
- Affected components: [List]
- User impact estimate: [Percentage/description]
Abort Conditions:
- Error rate > [threshold]
- Latency p99 > [threshold]
- [Business metric] breached
- Customer complaints received
Rollback Steps:
1. [Step to revert fault]
2. [Step to verify recovery]
Success Criteria:
□ [Metric] remains within [bounds]
□ [Recovery] happens within [time]
□ [Alerts] fire as expected
Prioritize experiments by:
Risk-based prioritization:
| Factor | Weight |
|---|---|
| Likelihood of failure | High |
| Impact if it occurs | High |
| Current uncertainty | Medium |
| Ease of testing | Low |
Recommended order:
If multiple experiments or team exercise desired:
GameDay Plan: [Title]
Date: [Proposed date]
Duration: [Hours]
Participants: [Teams/roles needed]
Objectives:
1. Validate [resilience pattern/assumption]
2. Practice [incident response/coordination]
3. Test [runbook/recovery procedure]
Pre-GameDay Checklist:
□ Stakeholder approval
□ Participant briefing scheduled
□ Monitoring dashboards verified
□ Kill switches tested
□ Rollback procedures documented
□ Communication channels set up
Schedule:
[Time] - Pre-brief and role assignment
[Time] - Baseline capture
[Time] - Scenario 1: [Name]
[Time] - Debrief / break
[Time] - Scenario 2: [Name]
[Time] - Hot debrief
[Time] - Cleanup and verification
Scenarios:
Scenario 1: [Name]
- Objective: [What we're testing]
- Hypothesis: [Expected behavior]
- Injection: [Fault details]
- Duration: [Time]
- Success criteria: [Metrics]
Scenario 2: [Name]
[Same structure]
Safety:
- Kill switch: [How to immediately stop]
- Rollback: [How to revert all changes]
- Communication: [Primary channel]
- Escalation: [Who to contact if real incident]
Roles:
- GameDay Lead: [Responsibilities]
- Scenario Executor: [Responsibilities]
- Observers: [Responsibilities]
- Scribe: [Responsibilities]
Post-GameDay:
- Hot debrief: Same day
- Formal postmortem: Within 1 week
- Action items tracked in: [System]
Generate deliverables:
# Plan chaos for a specific service
/sd:chaos-plan order-service
# Plan with architecture context
/sd:chaos-plan @docs/architecture/payment-system.md
# Plan GameDay for entire system
/sd:chaos-plan "e-commerce platform" --gameday
Use AskUserQuestion to:
The command produces:
This command leverages:
chaos-engineering-fundamentals - Experiment design principlesresilience-patterns - Patterns to test and validategameday-planning - GameDay execution guidanceincident-response - Handling discovered issuesFor ongoing chaos engineering consultation:
chaos-engineer - Resilience testing expertisenpx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designDesigns chaos engineering experiments through 6 phases: scope definition, steady-state baseline, hypothesis formation, failure injection planning, execution, and results analysis. For resilience testing and game days.
Designs chaos engineering experiments guiding scope, steady-state baseline, hypothesis, failure injection plans, execution, and analysis. For game days, resilience testing, and system stability validation.
Guides chaos engineering fundamentals: define steady-state hypotheses, inject realistic failures, run production experiments, observe results, and improve resilience.