Skill

sysdesign-rate-limiting-strategies

Use when choosing a rate-limiting algorithm — picks token bucket, leaky bucket, fixed window, or sliding window and places it stateful vs stateless.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/matilha-sysdesign-pack:sysdesign-rate-limiting-strategies

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use when the team is protecting an API from abuse, runaway clients, or cost

SKILL.md

188 lines · ~2.4k tokens

Stats

Stars0

MaintenanceExcellent

Last CommitMay 5, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

When this fires

Use when the team is protecting an API from abuse, runaway clients, or cost blowouts and has to pick a specific rate-limiting algorithm plus a place to store counters. Fires when someone says "just rate limit it" without naming burst tolerance, stateful storage cost, or per-user vs per-endpoint scope. The skill narrows the choice to a single algorithm, names the counter store, and flags the "don't punish good users" bias that Zhiyong Tan emphasises throughout the rate-limiting chapter.

Preconditions

The service already has observable traffic — raw QPS or a representative spike profile. Rate-limit tuning without traffic data is cargo culting.
Someone can answer "what does legit burst traffic look like for our users?" even roughly. Without that, every algorithm will over- or under-limit.
The team has agreed on the scope of the limit (per user, per API key, per endpoint, per IP, or a combination). These scopes compose badly if chosen after the fact.
There is a fallback behaviour defined — 429 with Retry-After, a soft queue, or shadow-mode logging. Rate limiting that silently drops requests is a debugging nightmare.

Execution Workflow

Pull the traffic shape. Use Read or the observability tool of record to fetch P50 and P99 QPS per user (or per key) over a representative window. Burst ratio (P99 over P50) drives algorithm choice more than absolute volume.
Pick the algorithm against the burst profile. Token bucket absorbs bursts up to the bucket size while enforcing a long-run average — the default when real users are spiky (human clickers, mobile retries). Leaky bucket enforces a strict steady rate — the default when a downstream dependency cannot handle spikes (legacy DB, third-party API with its own quota). Fixed window is the simplest counter but has a 2x burst failure mode at window boundaries — fine for coarse limits, dangerous for tight ones. Sliding window log or sliding window counter smooths that boundary problem at the cost of more storage per user.
Place the counter. Stateful means a shared store (Redis, DynamoDB) holds counters keyed by user or endpoint — single source of truth, accurate across replicas, but adds a hop and a hot-key failure mode. Stateless means each sidecar or LB holds its own local counter — fast, no extra hop, but sums across replicas drift. Use stateful when limits are tight and fairness matters (paid tiers, anti-abuse). Use stateless when the limit is loose and per-instance is "close enough" (DDoS shield).
Name the scope grid. Typical grids are per-user-per-endpoint (billing limits), per-IP-global (anti-DDoS), and per-API-key-per-endpoint (B2B plans). Write the grid down before coding — scopes added later require schema changes in the counter store.
Define the 429 contract. Include Retry-After in seconds, a rate limit policy header if the API is external, and a correlation ID for support debugging. Undocumented 429 responses break client retry logic.
Shadow-mode before enforcement. Log "would have blocked" for a full week against production traffic. If the count of would-have-blocks against paying customers is non-trivial, the limits are wrong, not the users.
Wire dashboards for block-rate per scope plus a 4 golden signals panel — see sysdesign-monitoring-4-golden-signals. A rate limiter without a dashboard is a production outage waiting for a customer email.

Rules: Do

Favour token bucket as the default for user-facing APIs. Real human traffic is bursty and leaky bucket will frustrate normal users on legitimate retries.
Store counters in a shared store (Redis with eviction) when the limit enforces billing or fairness. Replica-local counters drift and cost the business real money.
Pair every limit with an explicit 429 contract including Retry-After. Clients need a deterministic recovery path.
Ship in shadow mode first, then enforce. The false-positive cost on real users is higher than the abuse cost during one week of observation.
Prefer per-user scopes over per-IP scopes for authenticated APIs. IP limiting breaks mobile carriers and corporate NAT.

Rules: Don't

Don't choose fixed window for tight limits (under 100 req/min). The boundary-burst failure lets a caller double the limit by straddling windows.
Don't build counters in the application process for fairness-critical limits. Autoscaling changes the replica count and the effective limit silently shifts.
Don't alert on high 429 rate without splitting by scope. A spike from one abusive key looks identical to a broad regression.
Don't omit Retry-After. Clients without it will retry in tight loops and amplify the problem the limiter was meant to contain.
Don't limit logged-out and logged-in traffic with the same policy. Signal quality is very different; the scopes deserve distinct limits.

Expected Behavior

After applying the skill, the team has named exactly one algorithm, one counter store, one scope grid, and one 429 contract. Shadow-mode logs run for at least a week before enforcement. Dashboards show block-rate per scope, and there is a documented rollback if paying customers start seeing 429s unexpectedly.

Debates about "which algorithm is best" stop; the answer is now a function of the burst profile and the fairness requirement, both written down.

Quality Gates

Algorithm choice cites the burst profile (P99/P50 ratio) or downstream constraint that motivated it.
Counter store is named (Redis cluster, DynamoDB, in-process) with the drift cost explicitly accepted.
Scope grid listed as a table — user, IP, key, endpoint combinations — not implied.
429 response contract includes Retry-After and a rate-limit policy header; sample response pasted in the design doc.
Shadow-mode window agreed (one week minimum for weekly seasonality).
Block-rate dashboard exists per scope, not just globally.

Companion Integration

Pairs with sysdesign-monitoring-4-golden-signals (observability on the limiter itself), sysdesign-fault-tolerance-patterns (circuit breakers downstream of a limiter), and sysdesign-load-balancers (where L7 LBs can host the limiter natively). The matilha-harness-pack:harness-nfrs-as-prompts companion is the agent-side mirror — encoding rate-limit NFRs into an agent system prompt instead of an API enforcement layer.

Output Artifacts

A design-doc section titled "Rate limiting" naming algorithm, store, scope grid, and 429 contract.
Optional: a rate-limits.yaml file listing (scope, limit, window, algorithm) rows checked into the repo beside OpenAPI specs.
Dashboard link or panel JSON for block-rate per scope.
Shadow-mode log sample pasted into the design doc before enforcement.

Example Constraint Language

Use "must" for: defining the 429 contract before enforcement, shadow-mode observation window of at least one week, storing fairness-critical counters in a shared store.
Use "should" for: token bucket as default for human-facing APIs, per-user over per-IP for authenticated traffic, dashboard per scope.
Use "may" for: stateless sidecar counters on best-effort DDoS shields, fixed window on coarse limits above a few hundred RPM, application-level fallback queueing instead of hard 429.

Troubleshooting

"Legit users keep hitting 429 after a UI retry": the algorithm is probably leaky bucket or fixed window. Switch to token bucket sized to absorb a normal retry burst (3-5 requests in two seconds).
"Limits work on one replica, fail on another": counters are in-process and replicas drifted after autoscale. Move to a shared store.
"One abusive key is tanking the global error rate": dashboards are not split by scope. Add per-scope block-rate panels and a separate alert path for single-key anomalies.
"Shadow mode shows we would block 8% of paid customers": the limits are mis-tuned, not the users. Pull P99 per paid user, set the limit at the 99.5 percentile, re-run shadow.
"Mobile carrier NAT is getting blocked by per-IP limit": switch the scope to per-user for authenticated traffic and keep per-IP only for unauthenticated endpoints.

Concrete Example

A SaaS team launches a public API with a 1000 req/min per-key limit using fixed window in Redis. Boundary bursts let a client spike 1900 requests in two seconds across the window boundary, tripping the downstream DB. Switching to token bucket with bucket size 200 and refill 1000/min keeps the long-run average identical while smoothing the boundary. Shadow mode shows three paying keys hit the new limit, so the team adds a scoped override for those tiers before enforcing. Post-launch, 429 rate stays at 0.3% globally and the DB stops seeing the boundary spikes.

Sources

[[concepts/nfr-system-design]] — rate limiting as an availability and fault-tolerance lever
[[concepts/design-cases]] — Design Rate Limiting (Chapter 8) case study
Zhiyong Tan, Acing the System Design Interview, Chapter 8. The "em caso de dúvida, não limite o usuário" bias is paraphrased from Tan's discussion of false-positive cost.

sysdesign-rate-limiting-strategies

Invocation

Context Preview

SKILL.md

sysdesign-rate-limiting-strategies

Invocation

Context Preview

SKILL.md

When this fires

Preconditions

Execution Workflow

Rules: Do

Rules: Don't

Expected Behavior

Quality Gates

Companion Integration

Output Artifacts

Example Constraint Language

Troubleshooting

Concrete Example

Sources

Similar Skills

When this fires

Preconditions

Execution Workflow

Rules: Do

Rules: Don't

Expected Behavior

Quality Gates

Companion Integration

Output Artifacts

Example Constraint Language

Troubleshooting

Concrete Example

Sources

Similar Skills