Design a Rate Limiter
Rate limiters are everywhere. Stripe uses them to stop a buggy client script from accidentally spamming thousands of payment requests. GitHub uses them so one CI pipeline can't choke the entire API for every other developer. Cloudflare uses them as the first line of defence against DDoS attacks.
And yet, when this question shows up in a system design interview, most engineers freeze. The algorithms sound similar, the edge cases are subtle, and "just use Redis" only gets you so far before the interviewer asks how you handle a race condition across 50 application servers.
This guide walks you through the full design — from clarifying requirements and picking the right algorithm, to building a distributed, Redis-backed rate limiter that can handle millions of requests per second. By the end, you'll know not just what to say, but why — which is what actually gets you the offer.
Problem Statement
Design a rate limiter that can:
- Limit the number of requests a client can make to an API within a given time window
- Be configurable per user, per API key, per IP address, or per endpoint
- Work correctly across multiple application servers (distributed environment)
- Return the right response headers and error codes to the client
- Have minimal impact on the latency of allowed requests
The scope here is a server-side, distributed rate limiter sitting in the request path — the kind you'd build as a shared middleware component or deploy at the API Gateway layer.
Requirements Gathering
Start every system design interview by clarifying requirements. This signals maturity and prevents you from designing the wrong thing.
Functional Requirements
- Throttle requests: Reject requests that exceed a configured threshold within a time window
- Multiple limiting dimensions: Support rate limiting by user ID, API key, IP address, and endpoint
- Multiple granularities: Support limits like "1000 requests per minute" or "10,000 requests per hour"
- Response signalling: Return HTTP
429 Too Many RequestswithRetry-Afterheaders for rejected requests - Rule configuration: Allow rules to be configured without re-deploying the service
- Per-tier limits: Different limits for free vs. premium users
Non-Functional Requirements
- Low latency: The rate limiting check should add no more than 1–5ms to each request
- High availability: The rate limiter should not become a single point of failure; if it goes down, the system should either fail-open or fail-closed based on policy
- Accuracy: Limits should be enforced correctly even under high concurrency across many servers
- Scalability: Handle tens of thousands of requests per second without becoming a bottleneck
- Durability: Rate limit counters don't need to survive server restarts — eventual accuracy is acceptable
Back-of-the-Envelope Calculations
Let's ground the design in numbers before diving into architecture.
Assume:
- 10M active users
- Each user averages 10 API calls/minute
- Peak multiplier: 5x
Average request rate: 10M users × 10 req/min = 100M req/min ≈ 1.67M req/sec
Peak request rate: ~8M req/sec
Rate limiter overhead:
- One Redis lookup per request
- Redis can handle 100K–1M ops/sec per node
- At 8M req/sec we need ~8–80 Redis nodes (depending on command complexity)
Storage per user (sliding window counter):
- 2 counters × 8 bytes + key overhead ≈ ~100 bytes per user per window
- 10M users × 100 bytes = ~1GB of Redis memory — very manageableThis tells us Redis is the right storage backend, and we'll need a small cluster. Memory is not a concern; throughput is the dimension to design for.
Where Does the Rate Limiter Live?
Before choosing an algorithm, decide where in the stack the rate limiter sits. This comes up in interviews and the trade-offs are real.
Option 1: Client-Side
A client SDK enforces its own rate limits before even sending a request.
- Pro: Zero network overhead, zero server load from throttled requests
- Con: Clients can be modified or bypassed. Never rely on client-side rate limiting alone.
Option 2: API Gateway (Recommended for Most Cases)
A centralised gateway — like AWS API Gateway, Kong, or Nginx — intercepts every request before it reaches your application servers.
Client → [API Gateway with Rate Limiter] → Application Servers → Database- Pro: Single enforcement point, language-agnostic, can be configured without touching application code
- Con: The gateway becomes a potential bottleneck and single point of failure; needs its own HA story
Option 3: Application Server Middleware
Rate limiting logic lives as middleware inside each application server, backed by a shared Redis instance.
Client → Load Balancer → [App Server + Rate Limiter Middleware] → Database
↕
Redis Cluster- Pro: Flexible, no extra network hop to a separate gateway
- Con: Every app server needs the Redis client; logic is harder to change centrally
In practice: Most large companies use the API Gateway approach for external APIs and middleware for internal service-to-service rate limiting. In your interview, recommend the API Gateway approach and explain both.
The Five Rate Limiting Algorithms
This is the meat of the interview. Know all five, their trade-offs, and when to reach for each one.
1. Fixed Window Counter
How it works: Divide time into fixed windows (e.g., one-minute buckets). Keep a counter per user per window. Reject requests once the counter exceeds the limit.
Window: [12:00:00 — 12:00:59]
User A counter: 0 → 1 → 2 → ... → 100 (limit reached, reject)
Window resets at: 12:01:00
User A counter: 0 (fresh start)Redis implementation:
def is_allowed_fixed_window(user_id: str, limit: int, window_seconds: int) -> bool:
now = int(time.time())
window_key = f"ratelimit:{user_id}:{now // window_seconds}"
count = redis.incr(window_key)
if count == 1:
# Set expiry only on first increment
redis.expire(window_key, window_seconds * 2)
return count <= limitPros:
- Dead simple to implement and understand
- Memory efficient: one counter per user per window
- O(1) time complexity
Cons:
- The boundary problem. A user can send 100 requests at 12:00:59 and another 100 at 12:01:01 — that's 200 requests in 2 seconds while technically respecting the "100 per minute" rule. This is the classic fixed window exploit.
When to use: Simple internal quotas where boundary gaming isn't a concern. Not suitable for external-facing APIs.
2. Sliding Window Log
How it works: Instead of a counter, store a sorted set of timestamps for every request. To check a new request, count how many timestamps fall within the last N seconds. If over the limit, reject.
def is_allowed_sliding_log(user_id: str, limit: int, window_seconds: int) -> bool:
now = time.time()
window_start = now - window_seconds
key = f"ratelimit:log:{user_id}"
# Atomic pipeline
pipe = redis.pipeline()
pipe.zremrangebyscore(key, 0, window_start) # Remove old entries
pipe.zcard(key) # Count remaining
pipe.zadd(key, {str(now): now}) # Add current request
pipe.expire(key, window_seconds)
results = pipe.execute()
count = results[1]
return count < limit # Check before the addPros:
- Perfectly accurate — no boundary gaming possible
- The sliding window is truly continuous
Cons:
- Memory hungry. Each request timestamp is stored. At 1000 requests/minute per user, that's 1000 entries in Redis per user per window.
- Not practical at extreme scale
When to use: Low-volume APIs where accuracy is paramount (e.g., financial transaction APIs). Not suitable for high-throughput systems.
3. Sliding Window Counter (Hybrid — Recommended)
How it works: This is the sweet spot. Instead of storing individual timestamps, keep counters for the current and previous windows, then use a weighted calculation to approximate the sliding window.
If you're 75% through the current window:
Estimated count = (previous_window_count × 0.25) + current_window_countdef is_allowed_sliding_counter(user_id: str, limit: int, window_seconds: int) -> bool:
now = time.time()
current_window = int(now // window_seconds)
prev_window = current_window - 1
elapsed_in_window = now % window_seconds
weight = elapsed_in_window / window_seconds # How far into current window we are
curr_key = f"ratelimit:{user_id}:{current_window}"
prev_key = f"ratelimit:{user_id}:{prev_window}"
pipe = redis.pipeline()
pipe.get(prev_key)
pipe.get(curr_key)
results = pipe.execute()
prev_count = int(results[0] or 0)
curr_count = int(results[1] or 0)
estimated = (prev_count * (1 - weight)) + curr_count
if estimated >= limit:
return False
redis.incr(curr_key)
redis.expire(curr_key, window_seconds * 2)
return TruePros:
- Accurate approximation (error rate is typically less than 1% for smooth traffic)
- Memory efficient: only two counters per user
- Eliminates the boundary spike problem
Cons:
- Slightly complex to reason about
- An approximation, not a guarantee (acceptable for most use cases)
When to use: The best general-purpose algorithm for external APIs. This is what Cloudflare uses at scale.
4. Token Bucket
How it works: Each user has a "bucket" with a maximum capacity of N tokens. Tokens are added at a fixed refill rate. Each request consumes one token. If the bucket is empty, the request is rejected.
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
t=0: Bucket = 100 tokens
t=0: 50 requests arrive → Bucket = 50 tokens (burst handled)
t=5s: 50 tokens refilled → Bucket = 100 tokens
t=5s: 120 requests arrive → 100 allowed, 20 rejectedRedis implementation using atomic Lua script (critical for correctness):
-- token_bucket.lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Calculate tokens to add based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + (elapsed * refill_rate))
if new_tokens >= 1 then
redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return 1 -- allowed
else
return 0 -- rejected
endPros:
- Naturally handles burst traffic — users can "save up" unused capacity
- Smooth enforcement of long-term average rate
- Two intuitive parameters: capacity (burst tolerance) and refill rate (sustained throughput)
Cons:
- Two parameters make it slightly harder to reason about
- Doesn't guarantee a perfectly smooth request rate
When to use: APIs that need to tolerate bursts. Stripe, for example, allows short bursts well above the sustained rate for legitimate payment processing peaks.
5. Leaky Bucket
How it works: Requests go into a FIFO queue (the "bucket"). The queue drains at a fixed constant rate. If the bucket is full when a new request arrives, that request is dropped immediately.
Queue capacity: 100 requests
Drain rate: 10 requests/second
Requests enter at any rate → processed at exactly 10/sec → excess droppedPros:
- Output rate is perfectly constant and predictable
- Great for protecting downstream services from traffic spikes
Cons:
- Requests can wait in the queue, adding latency — problematic for real-time APIs
- Doesn't allow any bursting; a sudden spike of legitimate traffic gets dropped
When to use: Rate-limiting outbound calls to a third-party service that can't handle bursts (e.g., an SMS gateway with strict throughput limits). Less common for inbound API rate limiting.
Algorithm Comparison at a Glance
Algorithm | Burst Handling | Memory | Accuracy | Complexity
-----------------------|----------------|----------|----------|------------
Fixed Window Counter | Yes (at edges) | Low | Low | Very Low
Sliding Window Log | No | High | Perfect | Medium
Sliding Window Counter | Partial | Low | High | Medium
Token Bucket | Yes | Low | High | Medium
Leaky Bucket | No | Medium | Perfect | MediumInterview tip: When asked which algorithm to choose, say "It depends on the use case." Then demonstrate you know the trade-offs. For a general external-facing API, the sliding window counter or token bucket are usually the right answer. Name them and explain why.
High-Level Architecture
Here's a distributed rate limiter that can handle millions of requests per second:
┌──────────────────────┐
│ Rate Limit Config │
│ (DB / Config Store) │
└──────────┬───────────┘
│ config reload
┌──────────────────────┼──────────────────────┐
│ │ │
┌─────────▼──────┐ ┌──────────▼─────┐ ┌──────────▼─────┐
│ App Server 1 │ │ App Server 2 │ │ App Server 3 │
│ + RL Middleware│ │ + RL Middleware│ │ + RL Middleware│
└─────────┬──────┘ └──────────┬─────┘ └──────────┬─────┘
│ │ │
└──────────────────────┼──────────────────────┘
│ atomic Redis ops
┌──────────▼───────────┐
│ Redis Cluster │
│ (Primary + Replicas) │
└──────────────────────┘Key Components
Rate Limit Middleware
- Runs in the request path on every app server
- Extracts the limiting key (user ID, API key, or IP) from the request
- Calls Redis atomically to check and update the counter
- Returns HTTP 429 with appropriate headers if the limit is exceeded
- Adds rate limit headers to all responses (allowed and rejected)
Redis Cluster
- Stores all rate limiting state (counters, token buckets, timestamps)
- Chosen for its sub-millisecond latency and native atomic operations (
INCR,EXPIRE, Lua scripts) - Cluster mode distributes keys across multiple nodes for horizontal scaling
- Replication ensures the cluster survives node failures
Rate Limit Config Store
- Stores rules like:
{user_tier: "free", endpoint: "/api/search", limit: 100, window: 60} - Rules are loaded into application memory at startup and refreshed periodically
- Keeps rule changes out of the hot path
Handling Distributed Race Conditions
This is where most interview answers fall apart. When you have 50 app servers all reading and writing to Redis concurrently, naive implementations break.
The Problem
Server 1: GET counter for user A → 99
Server 2: GET counter for user A → 99
Server 1: counter < 100, allow request. SET counter = 100
Server 2: counter < 100, allow request. SET counter = 100
Result: 2 requests allowed when limit was 100. User actually sent 101 total.The Solutions
Solution 1: Redis INCR + EXPIRE (Fixed Window)
INCR is atomic — it reads and increments in a single operation. No two clients can read the same value.
count = redis.incr(key) # atomic: read + increment + write
if count == 1:
redis.expire(key, window_seconds)
return count <= limitSolution 2: Lua Scripts (Token Bucket / Sliding Window Counter)
For multi-step operations (read → compute → write), use a Lua script. Redis executes Lua scripts atomically — the entire script runs as a single transaction.
-- All reads and writes in one atomic block
local tokens = redis.call('GET', KEYS[1])
-- ... compute ...
redis.call('SET', KEYS[1], new_value)
-- No other command can interleave hereSolution 3: Redis Transactions (WATCH/MULTI/EXEC)
Redis optimistic locking — if the key changes between WATCH and EXEC, the transaction aborts and you retry. Works, but retries under high contention can add latency. Lua scripts are generally preferred.
What to say in an interview: Lead with Lua scripts. They're the cleanest, most production-proven approach. Mention Redis's single-threaded command execution as the reason they work.
Response Headers — Don't Forget These
A well-designed rate limiter communicates clearly with clients. Include these headers on every response:
# For allowed requests
X-RateLimit-Limit: 1000 # The configured limit
X-RateLimit-Remaining: 742 # How many requests remain in this window
X-RateLimit-Reset: 1696118400 # Unix timestamp when the window resets
# For throttled requests (HTTP 429)
HTTP/1.1 429 Too Many Requests
Retry-After: 47 # Seconds until the client can retry
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1696118400These headers matter because they let API clients implement exponential backoff correctly instead of hammering your service.
Rate Limiting Rules & Policies
In a real system, rate limits aren't one-size-fits-all. You need a rules engine.
Example Rule Schema
{
"rules": [
{
"id": "free-tier-global",
"match": {"user_tier": "free"},
"limit": 1000,
"window_seconds": 3600,
"algorithm": "sliding_window_counter"
},
{
"id": "premium-search-endpoint",
"match": {"user_tier": "premium", "endpoint": "/api/search"},
"limit": 500,
"window_seconds": 60,
"algorithm": "token_bucket",
"bucket_capacity": 1000,
"refill_rate": 8.33
},
{
"id": "unauthenticated-ip",
"match": {"auth": false},
"limit": 20,
"window_seconds": 60,
"algorithm": "fixed_window"
}
]
}Rule Evaluation Order
Rules are evaluated from most specific to least specific:
- User ID + Endpoint (most specific)
- User ID only
- API Key
- IP Address
- Global fallback (least specific)
The first matching rule wins.
Failing Gracefully
Your rate limiter will eventually have a Redis outage. Decide your failure policy upfront.
Fail-Open
If Redis is unavailable, allow all requests through.
try:
allowed = check_rate_limit(user_id)
except RedisConnectionError:
allowed = True # Fail open — don't block legitimate traffic
log.error("Rate limiter unavailable, failing open")- Pro: Service stays available; no false rejections
- Con: Abuse can slip through during the outage window
Fail-Closed
If Redis is unavailable, reject all requests.
except RedisConnectionError:
return False # Fail closed — reject everything- Pro: Consistent enforcement; protects downstream systems
- Con: You're taking your own service down during a Redis failure
In practice: Most external-facing APIs fail-open and rely on circuit breakers and monitoring to catch the outage quickly. Financial or security-sensitive endpoints sometimes fail-closed.
Scaling the Rate Limiter
Redis Cluster Sharding
Rate limit keys are naturally distributed by user ID. Redis Cluster hashes keys to one of 16,384 hash slots, spread across nodes. This gives you horizontal scaling with no hotspots (unless you have a single "super user" sending millions of requests — handle that at the application layer with a per-user dedicated key strategy).
Local Cache Layer (For Extreme Scale)
At very high throughput (millions of req/sec), even Redis can be a bottleneck. Add a local in-memory counter on each app server that syncs to Redis every 100ms.
Trade-off: A user could burst to N × limit for up to 100ms
(where N = number of app servers) before global enforcement kicks in.This is the approach used by systems handling truly massive scale. Mention it as an optimization, along with the accuracy trade-off.
Read Replicas
For the rate limit config store (not the counters), use Redis replicas to spread read load across your app servers without hitting the primary for every rule lookup.
Monitoring and Alerting
Key Metrics to Track
Rate Limiter Health
- Redis command latency (p50, p95, p99) — alert if p99 exceeds 5ms
- Redis memory utilization — alert above 80%
- Rate limiter middleware error rate — alert above 0.01%
Traffic Patterns
- Requests throttled per minute (absolute and percentage of total)
- Top throttled users / IPs (to identify abuse or misconfiguration)
- 429 response rate per endpoint
Business Metrics
- False positive rate (legitimate traffic being throttled) — requires user feedback loops
- Rate limit breach patterns (detect coordinated attacks)
Common Interview Follow-ups
"How would you rate limit GraphQL where a single query can be cheap or very expensive?"
GraphQL makes request-count-based limits almost useless — one query can fetch thousands of records while another fetches one field. The solution is cost-based rate limiting, where each query is assigned a complexity score based on the fields and depth requested, and the rate limit is enforced against cumulative complexity rather than raw request count. GitHub's GraphQL API does exactly this.
"How do you prevent a user from bypassing IP-based rate limiting?"
Sophisticated attackers rotate IPs or use proxy networks. Layered defences work better: combine IP-based limits with user-account-based limits, API key limits, and device fingerprinting. For the most abusive actors, use CAPTCHA challenges or require account verification.
"What if different regions have very different traffic patterns?"
Consider geo-distributed rate limiting where each region enforces limits independently against a regional Redis cluster, with periodic synchronisation to a global counter. This reduces cross-region latency at the cost of brief windows where a user could exceed their global limit by spreading requests across regions. Mention this as a trade-off and let the interviewer decide which direction to explore.
"How do you handle clock skew in a distributed system?"
Token bucket and sliding window calculations depend on accurate timestamps. If servers have clock skew, refill calculations become incorrect. Use Redis TIME command to get the Redis server's timestamp (single source of truth) rather than the application server's local clock. This eliminates skew issues entirely.
"What's the difference between rate limiting and throttling?"
Rate limiting is hard enforcement — requests over the limit are rejected immediately. Throttling is softer — requests are queued and delayed rather than dropped. Throttling is appropriate when brief delays are acceptable and you'd rather slow down a user than reject them outright. Leaky bucket is a throttling approach; token bucket and sliding window are rate limiting approaches.
Quick Interview Checklist
Before wrapping up your answer, make sure you've covered these:
- ✅ Clarified whether this is inbound API rate limiting, outbound call throttling, or both
- ✅ Chosen an algorithm and explained why (not just what)
- ✅ Addressed distributed correctness (race conditions, Lua scripts)
- ✅ Mentioned Redis as the storage backend and why (atomic ops, sub-millisecond latency)
- ✅ Discussed the failure mode (fail-open vs. fail-closed)
- ✅ Mentioned response headers (429, Retry-After, X-RateLimit-*)
- ✅ Covered how rules are configured and evaluated
- ✅ Talked about monitoring and what metrics matter
Conclusion
Rate limiters look deceptively simple until you start thinking about distributed systems. The real challenge isn't picking an algorithm — it's making sure the algorithm runs correctly when you have 50 app servers, a Redis cluster, and millions of requests hitting you simultaneously.
The key design decisions in a nutshell:
- Algorithm: Sliding window counter or token bucket for most external APIs. Fixed window if simplicity is paramount and boundary gaming isn't a concern.
- Storage: Redis, always. Its atomic operations (
INCR, Lua scripts) are what make distributed rate limiting tractable. - Race conditions: Use Lua scripts. They execute atomically and eliminate the read-compute-write race.
- Failure mode: Fail-open for most APIs; fail-closed for sensitive endpoints.
- Communication: Always return
Retry-AfterandX-RateLimit-*headers so clients can behave well.
Rate limiters are one of those problems that reward engineers who understand distributed systems fundamentals. Get the basics right, then show the interviewer you know the edge cases — that combination is what separates a good answer from a great one.
Preparing for system design interviews? Mockingly.ai is built specifically for engineers who want to get serious about system design — whether it's backend, frontend, or mobile. Practice with realistic interview simulations and get the feedback loop you actually need before the real thing.
