From agentic-qe-fleet
Guides chaos engineering experiments: define steady states, inject failures (network latency, instance kills, app exceptions), observe metrics (error rate, latency), validate recovery for distributed systems resilience.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-qe-fleet:chaos-engineering-resilienceThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<default_to_action>
<default_to_action> When testing system resilience or injecting failures:
Quick Chaos Steps:
Critical Success Factors:
| Category | Failures | Tools |
|---|---|---|
| Network | Latency, packet loss, partition | tc, toxiproxy |
| Infrastructure | Instance kill, disk failure, CPU | Chaos Monkey |
| Application | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| Dependencies | Service outage, timeout | WireMock |
Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
↓ ↓ ↓ ↓
Learn Validate Careful Full confidence
| Metric | Normal | Alert Threshold |
|---|---|---|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |
// Chaos experiment definition
const experiment = {
name: 'Database latency injection',
hypothesis: 'System handles 500ms DB latency gracefully',
steadyState: {
errorRate: '< 0.1%',
p99Latency: '< 300ms'
},
method: {
type: 'network-latency',
target: 'database',
delay: '500ms',
duration: '5m'
},
rollback: {
automatic: true,
trigger: 'errorRate > 5%'
}
};
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
target: 'payment-service',
failure: 'terminate-random-instance',
blastRadius: '10%',
duration: '5m',
steadyStateHypothesis: {
metric: 'success-rate',
threshold: 0.99
},
autoRollback: true
}, "qe-chaos-engineer");
// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately
aqe/chaos-engineering/
├── experiments/* - Experiment definitions & results
├── steady-states/* - Baseline measurements
├── runbooks/* - Generated recovery procedures
└── blast-radius/* - Impact analysis
const chaosFleet = await FleetManager.coordinate({
strategy: 'chaos-engineering',
agents: [
'qe-chaos-engineer', // Experiment execution
'qe-performance-tester', // Baseline metrics
'qe-production-intelligence' // Production monitoring
],
topology: 'sequential'
});
Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.
npx claudepluginhub proffesor-for-testing/agentic-qe --plugin agentic-qe-fleetExecutes chaos engineering experiments injecting failures like network latency, service crashes, resource exhaustion to test resilience in distributed systems.
Injects controlled faults like network partitions, latency, process kills, disk pressure into distributed systems and validates recovery for chaos engineering.
Injects controlled failures into systems to discover weaknesses before they cause outages. Use when validating reliability before launches or investigating SLO breaches.