From qa-api-testing
Builds a workflow that runs the project's existing API tests under injected network chaos - latency, timeouts, dropped connections, bandwidth caps, packet loss - using Toxiproxy as the proxy layer (with notes on alternatives Pumba / Gremlin / LitmusChaos). Defines a chaos matrix per test scenario, runs each, and reports which assertions break under which conditions. Use when the API surface needs to verify resilience patterns (retry, circuit-breaker, timeout, fallback) actually work.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-api-testing:api-chaos-runnerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Most API tests run against perfect networks: <1ms latency, no
Most API tests run against perfect networks: <1ms latency, no packet loss, infinite bandwidth, deterministic ordering. Real production isn't like that. Network chaos testing drives the existing tests under controlled network impairment - the team discovers which retry / circuit-breaker / timeout patterns actually hold up before real customers find out.
The canonical open-source primitive is Toxiproxy - Shopify's "TCP proxy to simulate network and system conditions for chaos and resiliency testing" (toxiproxy-readme). The pattern: sit Toxiproxy between client and upstream; manipulate toxics (latency, timeout, bandwidth, etc.) during test execution.
This skill is build-an-X - the workflow chains the team's existing API tests (Postman / Karate / RestAssured / Tavern / Schemathesis) through a Toxiproxy-managed connection and orchestrates a per-scenario chaos matrix.
If the team is just starting API testing and has no resilience
patterns to verify, this skill is overkill - start with happy-path
coverage via postman-collections
or the language-native equivalents first.
| Tool | Layer | Best for |
|---|---|---|
| Toxiproxy | TCP proxy | Per-connection latency / bandwidth / timeout / drop. Most precise. |
| Pumba | Docker container | Container-level chaos (kill, pause, network). |
| Gremlin (commercial) | Multi-platform | Production-grade chaos with audit / approval flow. |
| LitmusChaos | Kubernetes operator | Cloud-native; experiments declared as CRDs. |
tc qdisc (Linux native) | Network interface | Lowest level; most setup; CI-friendly only with --cap-add NET_ADMIN. |
Default recommendation: Toxiproxy for per-API chaos in CI. The others fit when the team is already in those ecosystems (Docker-Compose-heavy projects, Kubernetes-first projects).
For each existing API test scenario, define a matrix of conditions to run it under:
| Scenario | Toxic | Expected behavior |
|---|---|---|
| Order create (POST /orders) | None (control) | 201 in <500ms |
| Order create | latency=1000ms | 201 in <2s (within timeout budget) |
| Order create | latency=10000ms | 504 with retry-after, OR client gives up |
| Order create | bandwidth=10kbps | 201 (eventual) OR 408 timeout |
| Order create | reset_peer | 502 with retry attempted |
| Order create | timeout | 504; circuit-breaker opens after 3rd |
Per toxiproxy-readme, the canonical toxic types include latency, down (forced failure), bandwidth, slow_close, timeout, slicer, limit_data, reset_peer.
The matrix is the load-bearing artifact: what the team expects under each condition is what differentiates resilience verification from "did the test pass?" The Expected column drives the assertions.
# docker-compose.test.yml
services:
toxiproxy:
image: ghcr.io/shopify/toxiproxy:latest
ports:
- 8474:8474 # control API
- 5432:5432 # proxied DB
- 8080:8080 # proxied API
app:
build: .
environment:
DATABASE_URL: 'postgres://user:pass@toxiproxy:5432/db'
EXTERNAL_API_URL: 'http://toxiproxy:8080'
Per toxiproxy-readme, the application points at Toxiproxy's listen ports rather than the upstream. Toxiproxy forwards to the real upstream when no toxic is active.
# Register the upstream
curl -d '{"name":"orders-api","listen":"0.0.0.0:8080","upstream":"orders-api-real:8080"}' \
http://toxiproxy:8474/proxies
Per toxiproxy-readme:
# 1000ms latency on every request through this proxy
toxiproxy-cli toxic add -t latency -a latency=1000 orders-api
# Bandwidth cap at 10 KB/s
toxiproxy-cli toxic add -t bandwidth -a rate=10 orders-api
# Forced timeout
toxiproxy-cli toxic add -t timeout -a timeout=5000 orders-api
# Remove all toxics
toxiproxy-cli toxic remove orders-api -n <toxic-name>
For a stateless add-test-remove cycle, the language-native client
libraries (toxiproxy-python, toxiproxy-node, toxiproxy-ruby,
toxiproxy-go) wrap the HTTP API.
A minimal runner shell script:
#!/usr/bin/env bash
# scripts/chaos-matrix.sh
set -e
PROXY=orders-api
TEST_CMD="npx newman run collections/orders.postman_collection.json -e environments/chaos.json -r cli,junit --reporter-junit-export results-$1.xml"
run_with_toxic() {
local label="$1"; local type="$2"; local args="$3"
echo "=== $label ==="
toxiproxy-cli toxic remove "$PROXY" -n latency 2>/dev/null || true
toxiproxy-cli toxic remove "$PROXY" -n bandwidth 2>/dev/null || true
toxiproxy-cli toxic remove "$PROXY" -n timeout 2>/dev/null || true
if [ -n "$type" ]; then
toxiproxy-cli toxic add -t "$type" $args "$PROXY"
fi
$TEST_CMD "$label" || true # don't bail; we want the matrix
}
run_with_toxic 'control' '' ''
run_with_toxic 'latency-1s' latency '-a latency=1000'
run_with_toxic 'bandwidth' bandwidth '-a rate=10'
run_with_toxic 'timeout' timeout '-a timeout=5000'
The matrix produces one JUnit XML per scenario. Aggregate them in the report stage.
A successful chaos run produces a resilience matrix report:
## API Chaos Matrix — verdict: REVIEW
| Scenario | Control | Latency 1s | Bandwidth 10k | Timeout 5s | Reset peer |
|------------------|:-------:|:----------:|:-------------:|:----------:|:----------:|
| POST /orders | ✅ | ✅ | ✅ | ✅ | ❌ |
| GET /orders/:id | ✅ | ✅ | ❌ | ✅ | ✅ |
| DELETE /orders/:id | ✅ | ✅ | ✅ | ✅ | ✅ |
### Failures
| Test | Toxic | Expected | Actual |
|------|-------------|---------------------------------------------------|--------|
| POST /orders | reset_peer | 502 + retry attempted; second attempt succeeds | 502; no retry observed in client logs |
| GET /orders/:id | bandwidth=10k | 200 in <30s | 408 timeout at 10s |
A green matrix isn't the goal - finding where resilience is missing is the goal. A failure under a chaos scenario is a feature request, not a bug in the test.
Match toxics to documented resilience requirements:
| Resilience pattern documented | Toxic to inject |
|---|---|
| Retry on 5xx | down (forced 5xx) or reset_peer |
| Timeout after Nms | latency=N+500 (force the timeout) |
| Circuit-breaker after 3 failures | down for ≥3 requests |
| Fallback to cache when upstream unreachable | down indefinitely |
| Bulkhead under load | bandwidth=very-low |
| Slow-loris client | slow_close on response |
Run only the toxics that map to a documented expectation; running every toxic against every endpoint is noise.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Chaos in production | Real users observe; oncall pages. | CI / staging only. Production chaos requires the team's full chaos engineering practice (Gremlin / Litmus + approval flow). |
| Per-PR chaos matrix | Adds 10+ minutes; team disables. | Nightly chaos runs; PR runs only the control row. |
| Asserting "chaos must not break anything" | Every system has a breaking point; the test trivially fails. | Assert specific resilience behavior under specific conditions; document the breaking point as accepted. |
Using down for everything | down forces 5xx; doesn't model real-world latency / bandwidth. | Mix latency, bandwidth, timeout, reset_peer for realistic mixes. |
| Skipping the control row | Without control, the matrix can't distinguish chaos failures from test bugs. | Always run a no-toxic scenario as the baseline. |
postman-collections,
tavern-testing,
karate-testing,
restassured-testing -
example-based test suites that this skill drives through chaos.npx claudepluginhub testland/qa --plugin qa-api-testingProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.