From harness-claude
Detects missing resilience patterns (circuit breakers, retries, timeouts, bulkheads) in service dependencies and recommends production-grade fault tolerance configurations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/harness-claude:harness-resilienceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> Circuit breakers, rate limiting, bulkheads, retry patterns, and fault tolerance analysis. Detects missing resilience patterns, evaluates failure modes, and recommends concrete configurations for production-grade fault tolerance.
Circuit breakers, rate limiting, bulkheads, retry patterns, and fault tolerance analysis. Detects missing resilience patterns, evaluates failure modes, and recommends concrete configurations for production-grade fault tolerance.
Inventory external dependencies. Scan the codebase for outbound connections:
axios, fetch, got, HttpClient, RestTemplate, reqwestMap existing resilience patterns. For each dependency found, check for:
opossum, cockatiel, Polly, resilience4j, hystrix usageDetect anti-patterns. Flag common resilience mistakes:
Build the dependency map. Produce a structured inventory:
Classify failure modes per dependency. For each external dependency:
Assess blast radius. For each failure mode:
Evaluate current coverage. Score each dependency on resilience coverage:
Prioritize gaps by risk. Combine criticality and coverage:
Check observability. For existing patterns, verify they emit metrics:
Select patterns per dependency. Based on the failure mode analysis:
Provide concrete configurations. For each recommended pattern, specify:
Design fallback strategies. For each critical dependency:
Generate implementation templates. Produce code snippets for:
Define health check contracts. Specify how each dependency should be health-checked:
Check pattern correctness. For each implemented pattern:
Verify test coverage. Check that resilience patterns are tested:
Verify observability. Confirm that metrics are emitted:
Produce the resilience report. Output a summary:
Run integration verification. If integration tests exist:
harness skill run harness-resilience -- Primary CLI entry point. Runs all four phases.harness validate -- Run after implementing recommended patterns to verify project integrity.harness check-deps -- Verify that new resilience libraries are properly declared and within boundary rules.emit_interaction -- Used at pattern selection (checkpoint:decision) when multiple valid patterns exist and trade-offs require human judgment.Glob -- Discover HTTP clients, middleware chains, and existing resilience pattern files.Grep -- Search for timeout configurations, retry logic, circuit breaker initialization, and anti-patterns.Write -- Generate implementation templates and resilience configuration files.Edit -- Add resilience wrappers to existing service clients.Phase 1: DETECT
Dependencies found:
- Stripe API (HTTP, critical): axios client in src/payments/stripe-client.ts
Resilience: timeout=5000ms, no retry, no circuit breaker, no fallback
- PostgreSQL (database, critical): pg pool in src/db/pool.ts
Resilience: pool max=20, no query timeout, no read replica fallback
- SendGrid (HTTP, optional): @sendgrid/mail in src/notifications/email.ts
Resilience: none
Anti-patterns:
- src/payments/stripe-client.ts:45 — retry on POST /charges without idempotency key
- src/db/pool.ts — no statement_timeout configured
Phase 2: ANALYZE
Stripe failure modes:
- Timeout: Payment page hangs, user retries, duplicate charges possible
- Outage: All payments fail, revenue impact immediate
- Blast radius: checkout flow, subscription renewal, refund processing
Risk: P0 (critical + partial coverage + anti-pattern)
Phase 3: DESIGN
Stripe recommendations:
- Add opossum circuit breaker: failureThreshold=50%, resetTimeout=30s
- Add idempotency key to all Stripe charge requests
- Set timeout to 8000ms (Stripe p99 is ~3s, 2.5x headroom)
- Fallback: queue payment for async retry via Bull queue
PostgreSQL recommendations:
- Set statement_timeout=5000 in pool config
- Add pg-pool error handler with connection retry
- Configure read replica for GET endpoints via pgBouncer
Phase 4: VALIDATE
Resilience coverage: 33% -> 100% (3/3 dependencies covered)
Anti-patterns resolved: 2/2
Tests needed: circuit breaker state transitions, idempotency key generation
Phase 1: DETECT
Dependencies found:
- user-service (gRPC, critical): @grpc/grpc-js in src/clients/user.client.ts
Resilience: deadline=5s, no retry, no circuit breaker
- inventory-service (gRPC, critical): no resilience configured
- Redis (cache, degraded): ioredis in src/cache/redis.ts
Resilience: reconnectOnError, no bulkhead, no fallback
Phase 2: ANALYZE
inventory-service outage:
- Product pages return 503, search results empty
- Blast radius: catalog, search, cart validation
- Risk: P0 (critical + no coverage)
Phase 3: DESIGN
inventory-service recommendations:
- Add cockatiel circuit breaker with ConsecutiveBreaker(5)
- Add retry with exponentialBackoff(1000, 2) maxAttempts=3
- Add deadline propagation from gateway timeout
- Fallback: serve cached inventory from Redis with staleness header
Redis recommendations:
- Add bulkhead: maxPoolSize=50, separate pools for cache vs sessions
- Add fallback: in-memory LRU cache (lru-cache, max 1000 items)
- Monitor: emit redis.command.duration histogram
Phase 4: VALIDATE
Coverage: 33% -> 100%
Tests verified: gRPC circuit breaker opens after 5 failures,
Redis fallback serves from LRU when Redis is down
| Rationalization | Reality |
|---|---|
| "That third-party API has 99.99% uptime — we don't need a circuit breaker" | 99.99% uptime means 52 minutes of downtime per year. That downtime will not occur as one predictable window — it will happen as degraded responses and timeouts during a traffic spike. Without a circuit breaker, every caller blocks for the full timeout duration, exhausting thread pools and cascading across the system. |
| "We have retry logic, so failures are handled" | Retry logic without a circuit breaker amplifies failures. When the downstream service is degraded, retries multiply the load on an already struggling system. Circuit breakers and retries are complementary controls, not alternatives. |
| "The fallback adds complexity — we'll add it if the circuit breaker actually opens" | A circuit breaker without a fallback is a different kind of failure mode, not resilience. When the circuit opens, users see an error instead of a degraded-but-functional experience. Fallbacks must be designed and tested before the circuit ever opens in production. |
| "Our database connection pool is 100 connections — that's plenty" | Connection pool size without query timeouts means slow queries hold connections indefinitely. A single slow query spike can exhaust the pool, causing every subsequent request to wait. Pool sizing and query timeouts are both required. |
| "The service is internal — it doesn't need rate limiting" | Internal services are often called by automated processes, CI pipelines, and batch jobs that can spike traffic in ways user-facing services do not. Missing rate limiting on internal services is a common cause of self-inflicted outages during deployments and data migrations. |
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeAssesses and guides implementation of system resilience patterns: circuit breakers, retries, bulkheads, graceful degradation, health checks, timeouts. Prevents cascading failures in distributed services.
Assists implementing circuit breakers, retries, bulkheads, and resilience patterns for fault-tolerant distributed systems.
Provides patterns for managing external dependencies: circuit breakers, timeouts, retries with exponential backoff, bulkhead, graceful degradation, and dependency isolation. Use with any service or API call.