rate-limiting-patterns | systems-design

Stats

Actions

Tags

rate-limiting-patterns | systems-design

Rate Limiting Patterns

Patterns for protecting APIs and services through rate limiting, throttling, and quota management.

When to Use This Skill

Implementing API rate limiting
Choosing rate limiting algorithms
Designing distributed rate limiting
Setting up quota management
Protecting against abuse

Why Rate Limiting

Protection against:
- DDoS attacks
- Brute force attempts
- Resource exhaustion
- Cost overruns (cloud APIs)
- Cascading failures

Business benefits:
- Fair resource allocation
- Predictable performance
- Cost control
- SLA enforcement

Rate Limiting Algorithms

Token Bucket

Concept: Tokens added at fixed rate, requests consume tokens

Configuration:
- Bucket size (max tokens): 100
- Refill rate: 10 tokens/second

Behavior:
┌─────────────────────────┐
│ Bucket (capacity: 100)  │
│ ████████████░░░░░░░░░░  │ 60 tokens available
└─────────────────────────┘
        ↑           ↓
   10 tokens/s   Request takes 1 token

Allows bursts up to bucket size, then rate-limited.

Characteristics:

Allows controlled bursts
Simple to implement
Memory efficient
Most common algorithm

Implementation sketch:

token_bucket:
  tokens = min(tokens + (now - last_update) * rate, capacity)
  if tokens >= cost:
    tokens -= cost
    return ALLOW
  return DENY

Leaky Bucket

Concept: Requests queue and process at fixed rate

┌─────────────────────────┐
│ Queue (capacity: 100)   │
│ ██████████████████████  │ Requests waiting
└──────────┬──────────────┘
           │
           ▼ Process at fixed rate (10/sec)
       [Processing]

Smooths traffic to constant rate.

Characteristics:

Smooth output rate
No bursts allowed
Requests may queue
Good for downstream protection

Fixed Window

Concept: Count requests in fixed time windows

Window: 1 minute, Limit: 100 requests

|-------- Window 1 --------|-------- Window 2 --------|
   95 requests                  ? requests
   [Allow]                      [Reset to 0]

Problem: Boundary burst
End of window 1: 100 requests
Start of window 2: 100 requests
= 200 requests in ~1 second span

Characteristics:

Simple to implement
Memory efficient
Boundary burst problem
Good for simple use cases

Sliding Window Log

Concept: Track timestamp of each request

Window: 1 minute, Limit: 100

Requests: [t-55s, t-50s, t-45s, ..., t-5s, t-2s, now]
Count all requests in [now - 60s, now]

No boundary burst problem, but memory intensive.

Characteristics:

Precise limiting
No boundary issues
Memory intensive (stores all timestamps)
Good for strict limits

Sliding Window Counter

Concept: Weighted average of current and previous windows

Previous window: 80 requests
Current window: 30 requests (40% through window)

Weighted count = 80 * 0.6 + 30 = 78
Limit: 100
Result: ALLOW (78 < 100)

Characteristics:

Approximation (usually good enough)
Memory efficient
Smooths boundary issues
Best balance for most cases

Algorithm Selection Guide

Algorithm	Burst Handling	Memory	Precision	Use Case
Token Bucket	Allows bursts	Low	Good	General API limiting
Leaky Bucket	No bursts	Low	Good	Smooth rate enforcement
Fixed Window	Boundary burst	Very Low	Poor	Simple limits
Sliding Log	No bursts	High	Exact	Strict compliance
Sliding Counter	Minimal burst	Low	Good	Best general choice

Distributed Rate Limiting

Challenge

Single node: Simple in-memory counter
Multiple nodes: Need coordination

Without coordination:
Node 1: 50 requests (under 100 limit)
Node 2: 50 requests (under 100 limit)
Node 3: 50 requests (under 100 limit)
Total: 150 requests (over 100 limit!)

Pattern 1: Centralized (Redis)

┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node 1  │     │ Node 2  │     │ Node 3  │
└────┬────┘     └────┬────┘     └────┬────┘
     │               │               │
     └───────────────┼───────────────┘
                     │
              ┌──────▼──────┐
              │    Redis    │
              │ (counters)  │
              └─────────────┘

Pros: Accurate, consistent
Cons: Redis dependency, latency, single point of failure

Pattern 2: Local + Sync

Each node gets fraction of limit:
- 3 nodes, 100 limit → 33 per node

Periodically sync to rebalance unused capacity.

Pros: Low latency, resilient
Cons: Less precise, sync complexity

Pattern 3: Sticky Sessions

Route same client to same node (by IP, API key, etc.)

Pros: Simple, no coordination needed
Cons: Uneven load, failover complexity

Redis Implementation

Token Bucket with Redis:

EVALSHA token_bucket_script 1 {key}
  {capacity} {refill_rate} {tokens_requested}

Script:
1. Get current tokens and timestamp
2. Calculate tokens to add since last request
3. If enough tokens, decrement and allow
4. Return tokens remaining

Rate Limit Headers

Standard headers to communicate limits to clients:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1640000000
Retry-After: 30  (when rate limited)

Or draft standard:
RateLimit-Limit: 100
RateLimit-Remaining: 45
RateLimit-Reset: 30

Rate Limit Response

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30

{
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded",
    "retry_after": 30,
    "limit": 100,
    "window": "1m"
  }
}

Multi-Tier Rate Limiting

Apply limits at multiple levels:

Level 1: Global (protect infrastructure)
  - 10,000 req/sec across all clients

Level 2: Per-tenant (fair allocation)
  - 1,000 req/min per organization

Level 3: Per-user (prevent abuse)
  - 100 req/min per user

Level 4: Per-endpoint (protect expensive operations)
  - 10 req/min for /export endpoint

Quota Management

Quota vs Rate Limit

Rate Limit: Requests per time window (burst protection)
  - 100 requests/minute

Quota: Total allocation over period (budget)
  - 10,000 API calls/month

Quota Tracking

Track usage:
- Per API key
- Per endpoint
- Per operation type

Alert thresholds:
- 80% usage: Warning notification
- 100% usage: Hard block or overage charges

Best Practices

Graceful Degradation

Instead of hard block:
1. Reduce quality (lower resolution, fewer results)
2. Queue requests (process later)
3. Serve cached responses
4. Allow burst with penalty (slower recovery)

Client-Side Handling

Implement exponential backoff:
1. Receive 429
2. Wait Retry-After (or 1s)
3. Retry
4. If 429 again, wait 2s
5. Continue doubling up to max (e.g., 60s)

Testing Rate Limits

Test scenarios:
- Burst traffic
- Sustained high traffic
- Clock skew (distributed systems)
- Recovery after limit
- Multiple client types

Related Skills

api-design-fundamentals - API design patterns
idempotency-patterns - Safe retries
quality-attributes-taxonomy - Performance attributes