From godmode
Guides chaos engineering workflows: define steady state, map failure domains, design experiments for network failures, timeouts, DNS issues using tc and toxiproxy. For resilience tests and game days.
How this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:chaosThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User invokes `/godmode:chaos`
/godmode:chaosBefore injecting failures, establish what "healthy" looks like:
STEADY STATE DEFINITION:
System: <service name / system boundary>
Architecture: <monolith | microservices | serverless>
Health indicators (must all be true for "steady state"):
- Response success rate: > <X>% (e.g., 99.9%)
- Response time P95: < <X>ms (e.g., 500ms)
- Error rate: < <X>% (e.g., 0.1%)
- Queue depth: < <N> messages (e.g., 1000)
- CPU usage: < <X>% (e.g., 80%)
- Memory usage: < <X>% (e.g., 85%)
- Active connections: < <N> (e.g., connection pool max)
...
Map all the ways the system can fail:
FAILURE DOMAIN MAP:
| Category | Components | Impact if Failed |
|--|--|--|
| Network | Load balancer | Total outage |
| | DNS resolution | Total outage |
| | Inter-service network | Partial outage |
| | External API access | Feature degraded |
| Compute | Application process | Service restart |
| | Worker processes | Queue backlog |
| | Cron/scheduled jobs | Delayed tasks |
| | Container/VM host | Service relocation |
| Storage | Primary database | Read/write loss |
IF experiment crashes service: halt and rollback. WHEN steady-state violated: record finding.
Create specific, controlled experiments for each failure domain:
CHAOS EXPERIMENT:
Name: <descriptive name>
Hypothesis: "When <failure condition>, the system will <expected behavior>"
Blast radius: <single request | single user | single service | entire system>
Duration: <how long to inject failure>
Rollback: <how to stop the experiment immediately>
Prerequisites:
- [ ] Steady state verified
- [ ] Monitoring dashboards open
- [ ] Rollback procedure tested
- [ ] Team notified (if production)
- [ ] Incident response team on standby (if production)
...
Experiment N1: Dependency Timeout
Hypothesis: "When the payment API responds slowly (5s+), the checkout
service returns a user-friendly error within 3 seconds and does not
block other requests."
Injection:
# Using tc (traffic control) to add latency
tc qdisc add dev eth0 root netem delay 5000ms
# Or using toxiproxy
toxiproxy-cli toxic add -n latency -t latency \
-a latency=5000 payment-api
...
Experiment N2: DNS Failure
Hypothesis: "System falls back to cached data when DNS fails." Injection: iptables -A OUTPUT -p udp --dport 53 -j DROP. Verify cached responses served, error messages shown for uncached.
Experiment N3: Packet Loss
Hypothesis: "With 10% packet loss, success rate stays >95%." Injection: tc qdisc add dev eth0 root netem loss 10%. Verify retry logic, SLO compliance, no pool exhaustion.
Experiment P1: Process Crash
Hypothesis: "When the application process crashes, it restarts within
30 seconds and no requests are dropped (load balancer removes unhealthy
instance)."
Injection:
# Kill application process
kill -9 $(pgrep -f "node server.js")
# Or in Kubernetes
kubectl delete pod <pod-name> --grace-period=0
Verify:
...
Experiment P2: Memory Pressure
Hypothesis: "At 90%+ memory, app sheds load gracefully." Injection: stress-ng --vm 1 --vm-bytes 80% --timeout 300s. Verify load shedding, no OOM kill, health check alive.
Experiment P3: CPU Saturation
Hypothesis: "At 95% CPU, health checks and critical paths prioritized." Injection: stress-ng --cpu $(nproc) --timeout 300s. Verify health check <1s, background deferred, autoscaling triggers.
Experiment S1: Database Failover
Hypothesis: "When the primary database fails, the system fails over to
the replica within 30 seconds with < 1 second of write unavailability."
Injection:
# Stop primary database
docker stop postgres-primary
# Or in cloud — promote replica
aws rds failover-db-cluster --db-cluster-identifier <cluster>
Verify:
- Read traffic continues on replica immediately
...
Experiment S2: Cache Failure (Cold Cache)
Hypothesis: "When Redis is unavailable, the system falls back to direct
database queries with acceptable performance degradation (P95 < 2s
instead of < 200ms)."
Injection:
# Flush all cached data
redis-cli FLUSHALL
# Or kill Redis entirely
docker stop redis
Verify:
...
Experiment S3: Disk Full
Hypothesis: "When disk reaches 95%, the system stops non-critical writes,
alerts operators, and continues serving read traffic."
Injection:
# Fill disk to 95%
fallocate -l $(df --output=avail / | tail -1 | awk '{print int($1*0.90)}')k /tmp/fill-disk
Verify:
- Log rotation and temp file cleanup triggered
- Non-critical writes (analytics, logs) paused
- Critical writes (transactions) continue to reserved space
- Alert fires with disk usage percentage
...
Specifically test circuit breaker behavior:
CIRCUIT BREAKER VALIDATION:
State Transitions
CLOSED ──(failures > threshold)──→ OPEN
| ▲ | |
| | | (timeout) |
| | ▼ |
└──(success)── HALF-OPEN ←─────────┘
CLOSED: Normal operation, requests flow through
OPEN: All requests fail fast (no network call)
HALF-OPEN: Limited requests to test recovery
Organize a structured resilience testing exercise:
GAME DAY PLAN:
Date: <scheduled date>
Duration: <2-4 hours>
Facilitator: <person>
Participants: <team members and roles>
OBJECTIVES:
1. Validate <specific resilience property>
2. Test <incident response procedure>
3. Verify <recovery time objective>
TIMELINE:
...
CHAOS ENGINEERING REPORT — <system>
Experiments run: <N>
Hypotheses confirmed: <N>/<total>
Surprises found: <N>
RESILIENCE SCORECARD:
┌─────────────────────────┬────────┬───────────────────┐
| | Failure Domain | Grade | Notes | |
├─────────────────────────┼────────┼───────────────────┤
| | Network latency | A/B/C/F | <detail> | |
| | Network partition | A/B/C/F | <detail> | |
| | Process crash | A/B/C/F | <detail> | |
| | Memory pressure | A/B/C/F | <detail> | |
docs/chaos/<system>-experiments.mddocs/chaos/<system>-gameday-plan.mddocs/chaos/<system>-resilience-report.md"chaos: <system> — <N> experiments, resilience: <grade>"/godmode:fix to address, then re-test."/godmode:ship."# Chaos injection tools
tc qdisc add dev eth0 root netem delay 500ms
kubectl delete pod <pod-name> --grace-period=0
stress-ng --cpu $(nproc) --vm 1 --vm-bytes 80% --timeout 60s
redis-cli FLUSHALL
| Flag | Description |
|---|---|
| (none) | Full chaos assessment — map failure domains, design experiments |
--experiment <name> | Run a specific pre-designed experiment |
--network | Network failure experiments only |
timestamp experiment_name hypothesis blast_radius duration result surprises
On activation, automatically detect infrastructure context:
AUTO-DETECT:
1. Container orchestration:
kubectl cluster-info 2>/dev/null && echo "kubernetes"
docker info 2>/dev/null && echo "docker"
2. Cloud provider:
aws sts get-caller-identity 2>/dev/null && echo "aws"
gcloud config get-value project 2>/dev/null && echo "gcp"
3. Service mesh / proxy:
kubectl get crd | grep -i istio && echo "istio"
linkerd check 2>/dev/null && echo "linkerd"
...
KEEP if: improvement verified. DISCARD if: regression or no change. Revert discards immediately.
Stop when: target reached, budget exhausted, or >5 consecutive discards.
npx claudepluginhub arbazkhan971/godmodeGuides chaos engineering experiments: planning failure modes, defining steady-state metrics, and scoping blast radius for resilience validation.
Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates.
Designs chaos experiments, failure injection frameworks, and game day exercises for distributed systems — producing runbooks, manifests, rollback procedures, and post-mortem templates.