From grimoire
Injects controlled failures into systems to discover weaknesses before they cause outages. Use when validating reliability before launches or investigating SLO breaches.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:run-chaos-engineeringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Deliberately inject failures into a system in a controlled way to discover weaknesses before they cause unplanned outages.
Deliberately inject failures into a system in a controlled way to discover weaknesses before they cause unplanned outages.
Adopted by: Netflix (Chaos Monkey, Simian Army), Amazon (GameDay), Google (DiRT — Disaster Recovery Testing), LinkedIn, Microsoft Azure Impact: Netflix found that Chaos Monkey reduced the impact of actual cloud failures by 85% after adoption; Basiri et al. (2016): systems with regular chaos testing have 50% fewer unplanned outages; proactive discovery is 10-100x cheaper to fix than production incidents Why best: Systems fail in unexpected ways under real conditions; testing assumptions in a controlled environment is the only way to discover failure modes before users do
Sources: Rosenthal, Jones, et al. "Chaos Engineering" O'Reilly (2020); Basiri et al. "Chaos Engineering" IEEE Software (2016); Nygard "Release It!" Pragmatic Programmers (2018)
Define steady-state behavior — Before injecting chaos, define what normal looks like: p99 latency < 200 ms, error rate < 0.1%, throughput > 1000 RPS. These are your steady-state hypotheses. Chaos engineering tests whether the system maintains steady state under failure conditions. Undefined steady state means undefined success criteria.
Formulate a hypothesis — "We believe that if a single availability zone fails, the system will fail over within 60 seconds with no more than 0.5% of requests failing." The hypothesis is specific, measurable, and falsifiable. Vague hypotheses ("the system will handle failure") are untestable.
Choose the minimum blast radius — Start with the smallest scope that validates the hypothesis: one container restart, not a full AZ failure. Chaos engineering is not about maximum destruction; it is about minimum sufficient injection to test the hypothesis. Escalate scope as confidence in the system's resilience grows.
Run in non-production first — Execute chaos experiments in staging with production-like traffic (via traffic mirroring or synthetic load). Graduating to production requires: successful non-prod results, automated abort conditions, stakeholder awareness, and a defined rollback procedure.
Define abort conditions — Before starting any experiment, define automatic and manual abort triggers: if error rate exceeds 1%, abort immediately and restore steady state. If on-call is paged during the experiment, abort. If the experiment is still running at the defined end time, abort. Abort conditions are non-negotiable; define them before you inject.
Inject failure — Use chaos engineering tools: AWS Fault Injection Service (FIS) for cloud failure injection, Chaos Monkey for instance termination, Pumba for Docker container chaos, Litmus or Chaos Mesh for Kubernetes, Gremlin for agent-based injection. Common injection types: instance termination, CPU pressure, memory pressure, network latency, network packet loss, disk I/O stress, dependency failure (kill a downstream service).
Monitor continuously during injection — Watch your observability stack in real time during the experiment. Track: error rate, latency, service health, and auto-recovery metrics. Compare against steady-state baseline. The goal is to observe what happens, not just whether the system recovers.
Analyze results and compare to hypothesis — Did the system maintain steady state? If yes: the hypothesis is validated; document the result and increase scope or move to the next failure mode. If no: the hypothesis is falsified; you've found a real reliability weakness. Document the failure mode and create a remediation backlog item.
Automate recurring experiments — After a successful experiment, automate it to run regularly in staging (weekly or on every deployment). Automated chaos prevents regressions: a code change that inadvertently breaks failure handling is caught before production. Use CI/CD integration to block deployments that fail chaos gates.
Document and share results — Publish experiment results (hypothesis, injection, observations, conclusions, remediations) to the engineering team. A culture of visible failure discovery reduces fear of chaos engineering and accelerates organizational learning. Every fixed weakness is a prevented outage.
npx claudepluginhub jeffreytse/grimoire --plugin grimoireDesigns chaos engineering experiments guiding scope, steady-state baseline, hypothesis, failure injection plans, execution, and analysis. For game days, resilience testing, and system stability validation.
Guides chaos engineering experiments: planning failure modes, defining steady-state metrics, and scoping blast radius for resilience validation.
Designs chaos engineering experiments through 6 phases: scope definition, steady-state baseline, hypothesis formation, failure injection planning, execution, and results analysis. For resilience testing and game days.