From systems-design
Covers rate limiting algorithms like token bucket, leaky bucket, fixed/sliding windows, and distributed patterns for API throttling and quotas.
How this skill is triggered — by the user, by Claude, or both
Slash command
/systems-design:rate-limiting-patternsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Patterns for protecting APIs and services through rate limiting, throttling, and quota management.
Patterns for protecting APIs and services through rate limiting, throttling, and quota management.
Protection against:
- DDoS attacks
- Brute force attempts
- Resource exhaustion
- Cost overruns (cloud APIs)
- Cascading failures
Business benefits:
- Fair resource allocation
- Predictable performance
- Cost control
- SLA enforcement
Concept: Tokens added at fixed rate, requests consume tokens
Configuration:
- Bucket size (max tokens): 100
- Refill rate: 10 tokens/second
Behavior:
┌─────────────────────────┐
│ Bucket (capacity: 100) │
│ ████████████░░░░░░░░░░ │ 60 tokens available
└─────────────────────────┘
↑ ↓
10 tokens/s Request takes 1 token
Allows bursts up to bucket size, then rate-limited.
Characteristics:
Implementation sketch:
token_bucket:
tokens = min(tokens + (now - last_update) * rate, capacity)
if tokens >= cost:
tokens -= cost
return ALLOW
return DENY
Concept: Requests queue and process at fixed rate
┌─────────────────────────┐
│ Queue (capacity: 100) │
│ ██████████████████████ │ Requests waiting
└──────────┬──────────────┘
│
▼ Process at fixed rate (10/sec)
[Processing]
Smooths traffic to constant rate.
Characteristics:
Concept: Count requests in fixed time windows
Window: 1 minute, Limit: 100 requests
|-------- Window 1 --------|-------- Window 2 --------|
95 requests ? requests
[Allow] [Reset to 0]
Problem: Boundary burst
End of window 1: 100 requests
Start of window 2: 100 requests
= 200 requests in ~1 second span
Characteristics:
Concept: Track timestamp of each request
Window: 1 minute, Limit: 100
Requests: [t-55s, t-50s, t-45s, ..., t-5s, t-2s, now]
Count all requests in [now - 60s, now]
No boundary burst problem, but memory intensive.
Characteristics:
Concept: Weighted average of current and previous windows
Previous window: 80 requests
Current window: 30 requests (40% through window)
Weighted count = 80 * 0.6 + 30 = 78
Limit: 100
Result: ALLOW (78 < 100)
Characteristics:
| Algorithm | Burst Handling | Memory | Precision | Use Case |
|---|---|---|---|---|
| Token Bucket | Allows bursts | Low | Good | General API limiting |
| Leaky Bucket | No bursts | Low | Good | Smooth rate enforcement |
| Fixed Window | Boundary burst | Very Low | Poor | Simple limits |
| Sliding Log | No bursts | High | Exact | Strict compliance |
| Sliding Counter | Minimal burst | Low | Good | Best general choice |
Single node: Simple in-memory counter
Multiple nodes: Need coordination
Without coordination:
Node 1: 50 requests (under 100 limit)
Node 2: 50 requests (under 100 limit)
Node 3: 50 requests (under 100 limit)
Total: 150 requests (over 100 limit!)
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────▼──────┐
│ Redis │
│ (counters) │
└─────────────┘
Pros: Accurate, consistent
Cons: Redis dependency, latency, single point of failure
Each node gets fraction of limit:
- 3 nodes, 100 limit → 33 per node
Periodically sync to rebalance unused capacity.
Pros: Low latency, resilient
Cons: Less precise, sync complexity
Route same client to same node (by IP, API key, etc.)
Pros: Simple, no coordination needed
Cons: Uneven load, failover complexity
Token Bucket with Redis:
EVALSHA token_bucket_script 1 {key}
{capacity} {refill_rate} {tokens_requested}
Script:
1. Get current tokens and timestamp
2. Calculate tokens to add since last request
3. If enough tokens, decrement and allow
4. Return tokens remaining
Standard headers to communicate limits to clients:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1640000000
Retry-After: 30 (when rate limited)
Or draft standard:
RateLimit-Limit: 100
RateLimit-Remaining: 45
RateLimit-Reset: 30
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
{
"error": {
"code": "RATE_LIMITED",
"message": "Rate limit exceeded",
"retry_after": 30,
"limit": 100,
"window": "1m"
}
}
Apply limits at multiple levels:
Level 1: Global (protect infrastructure)
- 10,000 req/sec across all clients
Level 2: Per-tenant (fair allocation)
- 1,000 req/min per organization
Level 3: Per-user (prevent abuse)
- 100 req/min per user
Level 4: Per-endpoint (protect expensive operations)
- 10 req/min for /export endpoint
Rate Limit: Requests per time window (burst protection)
- 100 requests/minute
Quota: Total allocation over period (budget)
- 10,000 API calls/month
Track usage:
- Per API key
- Per endpoint
- Per operation type
Alert thresholds:
- 80% usage: Warning notification
- 100% usage: Hard block or overage charges
Instead of hard block:
1. Reduce quality (lower resolution, fewer results)
2. Queue requests (process later)
3. Serve cached responses
4. Allow burst with penalty (slower recovery)
Implement exponential backoff:
1. Receive 429
2. Wait Retry-After (or 1s)
3. Retry
4. If 429 again, wait 2s
5. Continue doubling up to max (e.g., 60s)
Test scenarios:
- Burst traffic
- Sustained high traffic
- Clock skew (distributed systems)
- Recovery after limit
- Multiple client types
api-design-fundamentals - API design patternsidempotency-patterns - Safe retriesquality-attributes-taxonomy - Performance attributesnpx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designGuides rate limiting implementation using token bucket, sliding window counters, Redis Lua scripts, tiered quotas, middleware, headers, and monitoring to protect APIs from abuse and manage quotas.
Rate limiting strategies (token bucket, sliding window, quota), DOS protection, and fair usage.
Implements API rate limiting with sliding windows, token buckets, quotas using Redis and libraries for Node.js, Python/FastAPI, Java. Protects endpoints from excessive requests with headers and 429 responses.