Blog

How to Design a Rate Limiter - Complete System Design Guide

Blog

18 min

October 1, 2025

Mockingly

medium

How to Design a Rate Limiter - Complete System Design Guide

Asked at

Master rate limiter system design for your next big tech interview. Learn every algorithm — token bucket, sliding window, leaky bucket — plus Redis-backed distributed implementations, race condition fixes, and real-world trade-offs used by Stripe, GitHub, and Twitter.

Rate limiters are everywhere. Stripe uses them to stop a buggy client script from accidentally spamming thousands of payment requests. GitHub uses them so one CI pipeline can't choke the entire API for every other developer. Cloudflare uses them as the first line of defence against DDoS attacks.

And yet, when this question shows up in a system design interview, most engineers freeze. The algorithms sound similar, the edge cases are subtle, and "just use Redis" only gets you so far before the interviewer asks how you handle a race condition across 50 application servers.

This guide walks you through the full design — from clarifying requirements and picking the right algorithm, to building a distributed, Redis-backed rate limiter that can handle millions of requests per second. By the end, you'll know not just what to say, but why — which is what actually gets you the offer.

Problem Statement

Design a rate limiter that can:

Limit the number of requests a client can make to an API within a given time window
Be configurable per user, per API key, per IP address, or per endpoint
Work correctly across multiple application servers (distributed environment)
Return the right response headers and error codes to the client
Have minimal impact on the latency of allowed requests

The scope here is a server-side, distributed rate limiter sitting in the request path — the kind you'd build as a shared middleware component or deploy at the API Gateway layer.

Requirements Gathering

Start every system design interview by clarifying requirements. This signals maturity and prevents you from designing the wrong thing.

Functional Requirements

Throttle requests: Reject requests that exceed a configured threshold within a time window
Multiple limiting dimensions: Support rate limiting by user ID, API key, IP address, and endpoint
Multiple granularities: Support limits like "1000 requests per minute" or "10,000 requests per hour"
Response signalling: Return HTTP 429 Too Many Requests with Retry-After headers for rejected requests
Rule configuration: Allow rules to be configured without re-deploying the service
Per-tier limits: Different limits for free vs. premium users

Non-Functional Requirements

Low latency: The rate limiting check should add no more than 1–5ms to each request
High availability: The rate limiter should not become a single point of failure; if it goes down, the system should either fail-open or fail-closed based on policy
Accuracy: Limits should be enforced correctly even under high concurrency across many servers
Scalability: Handle tens of thousands of requests per second without becoming a bottleneck
Durability: Rate limit counters don't need to survive server restarts — eventual accuracy is acceptable

Back-of-the-Envelope Calculations

Let's ground the design in numbers before diving into architecture.

plaintext

Assume:
  - 10M active users
  - Each user averages 10 API calls/minute
  - Peak multiplier: 5x
 
Average request rate:   10M users × 10 req/min = 100M req/min ≈ 1.67M req/sec
Peak request rate:      ~8M req/sec
 
Rate limiter overhead:
  - One Redis lookup per request
  - Redis can handle 100K–1M ops/sec per node
  - At 8M req/sec we need ~8–80 Redis nodes (depending on command complexity)
 
Storage per user (sliding window counter):
  - 2 counters × 8 bytes + key overhead ≈ ~100 bytes per user per window
  - 10M users × 100 bytes = ~1GB of Redis memory — very manageable

This tells us Redis is the right storage backend, and we'll need a small cluster. Memory is not a concern; throughput is the dimension to design for.

Where Does the Rate Limiter Live?

Before choosing an algorithm, decide where in the stack the rate limiter sits. This comes up in interviews and the trade-offs are real.

Option 1: Client-Side

A client SDK enforces its own rate limits before even sending a request.

Pro: Zero network overhead, zero server load from throttled requests
Con: Clients can be modified or bypassed. Never rely on client-side rate limiting alone.

Option 2: API Gateway (Recommended for Most Cases)

A centralised gateway — like AWS API Gateway, Kong, or Nginx — intercepts every request before it reaches your application servers.

plaintext

Client → [API Gateway with Rate Limiter] → Application Servers → Database

Pro: Single enforcement point, language-agnostic, can be configured without touching application code
Con: The gateway becomes a potential bottleneck and single point of failure; needs its own HA story

Option 3: Application Server Middleware

Rate limiting logic lives as middleware inside each application server, backed by a shared Redis instance.

plaintext

Client → Load Balancer → [App Server + Rate Limiter Middleware] → Database
                                        ↕
                                   Redis Cluster

Pro: Flexible, no extra network hop to a separate gateway
Con: Every app server needs the Redis client; logic is harder to change centrally

In practice: Most large companies use the API Gateway approach for external APIs and middleware for internal service-to-service rate limiting. In your interview, recommend the API Gateway approach and explain both.

The Five Rate Limiting Algorithms

This is the meat of the interview. Know all five, their trade-offs, and when to reach for each one.

1. Fixed Window Counter

How it works: Divide time into fixed windows (e.g., one-minute buckets). Keep a counter per user per window. Reject requests once the counter exceeds the limit.

plaintext

Window: [12:00:00 — 12:00:59]
User A counter: 0 → 1 → 2 → ... → 100 (limit reached, reject)
 
Window resets at: 12:01:00
User A counter: 0 (fresh start)

Redis implementation:

python

def is_allowed_fixed_window(user_id: str, limit: int, window_seconds: int) -> bool:
    now = int(time.time())
    window_key = f"ratelimit:{user_id}:{now // window_seconds}"
    
    count = redis.incr(window_key)
    if count == 1:
        # Set expiry only on first increment
        redis.expire(window_key, window_seconds * 2)
    
    return count <= limit

Pros:

Dead simple to implement and understand
Memory efficient: one counter per user per window
O(1) time complexity

Cons:

The boundary problem. A user can send 100 requests at 12:00:59 and another 100 at 12:01:01 — that's 200 requests in 2 seconds while technically respecting the "100 per minute" rule. This is the classic fixed window exploit.

When to use: Simple internal quotas where boundary gaming isn't a concern. Not suitable for external-facing APIs.

2. Sliding Window Log

How it works: Instead of a counter, store a sorted set of timestamps for every request. To check a new request, count how many timestamps fall within the last N seconds. If over the limit, reject.

python

def is_allowed_sliding_log(user_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    window_start = now - window_seconds
    key = f"ratelimit:log:{user_id}"
    
    # Atomic pipeline
    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)       # Remove old entries
    pipe.zcard(key)                                    # Count remaining
    pipe.zadd(key, {str(now): now})                    # Add current request
    pipe.expire(key, window_seconds)
    results = pipe.execute()
    
    count = results[1]
    return count < limit  # Check before the add

Pros:

Perfectly accurate — no boundary gaming possible
The sliding window is truly continuous

Cons:

Memory hungry. Each request timestamp is stored. At 1000 requests/minute per user, that's 1000 entries in Redis per user per window.
Not practical at extreme scale

When to use: Low-volume APIs where accuracy is paramount (e.g., financial transaction APIs). Not suitable for high-throughput systems.

3. Sliding Window Counter (Hybrid — Recommended)

How it works: This is the sweet spot. Instead of storing individual timestamps, keep counters for the current and previous windows, then use a weighted calculation to approximate the sliding window.

plaintext

If you're 75% through the current window:
  Estimated count = (previous_window_count × 0.25) + current_window_count

python

def is_allowed_sliding_counter(user_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    current_window = int(now // window_seconds)
    prev_window = current_window - 1
    elapsed_in_window = now % window_seconds
    weight = elapsed_in_window / window_seconds  # How far into current window we are
 
    curr_key = f"ratelimit:{user_id}:{current_window}"
    prev_key = f"ratelimit:{user_id}:{prev_window}"
    
    pipe = redis.pipeline()
    pipe.get(prev_key)
    pipe.get(curr_key)
    results = pipe.execute()
    
    prev_count = int(results[0] or 0)
    curr_count = int(results[1] or 0)
    
    estimated = (prev_count * (1 - weight)) + curr_count
    
    if estimated >= limit:
        return False
    
    redis.incr(curr_key)
    redis.expire(curr_key, window_seconds * 2)
    return True

Pros:

Accurate approximation (error rate is typically less than 1% for smooth traffic)
Memory efficient: only two counters per user
Eliminates the boundary spike problem

Cons:

Slightly complex to reason about
An approximation, not a guarantee (acceptable for most use cases)

When to use: The best general-purpose algorithm for external APIs. This is what Cloudflare uses at scale.

4. Token Bucket

How it works: Each user has a "bucket" with a maximum capacity of N tokens. Tokens are added at a fixed refill rate. Each request consumes one token. If the bucket is empty, the request is rejected.

plaintext

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
 
t=0:   Bucket = 100 tokens
t=0:   50 requests arrive → Bucket = 50 tokens (burst handled)
t=5s:  50 tokens refilled → Bucket = 100 tokens
t=5s:  120 requests arrive → 100 allowed, 20 rejected

Redis implementation using atomic Lua script (critical for correctness):

lua

-- token_bucket.lua
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])
 
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
 
-- Calculate tokens to add based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + (elapsed * refill_rate))
 
if new_tokens >= 1 then
    redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
    return 1  -- allowed
else
    return 0  -- rejected
end

Pros:

Naturally handles burst traffic — users can "save up" unused capacity
Smooth enforcement of long-term average rate
Two intuitive parameters: capacity (burst tolerance) and refill rate (sustained throughput)

Cons:

Two parameters make it slightly harder to reason about
Doesn't guarantee a perfectly smooth request rate

When to use: APIs that need to tolerate bursts. Stripe, for example, allows short bursts well above the sustained rate for legitimate payment processing peaks.

5. Leaky Bucket

How it works: Requests go into a FIFO queue (the "bucket"). The queue drains at a fixed constant rate. If the bucket is full when a new request arrives, that request is dropped immediately.

plaintext

Queue capacity: 100 requests
Drain rate: 10 requests/second
 
Requests enter at any rate → processed at exactly 10/sec → excess dropped

Pros:

Output rate is perfectly constant and predictable
Great for protecting downstream services from traffic spikes

Cons:

Requests can wait in the queue, adding latency — problematic for real-time APIs
Doesn't allow any bursting; a sudden spike of legitimate traffic gets dropped

When to use: Rate-limiting outbound calls to a third-party service that can't handle bursts (e.g., an SMS gateway with strict throughput limits). Less common for inbound API rate limiting.

Algorithm Comparison at a Glance

plaintext

Algorithm              | Burst Handling | Memory   | Accuracy | Complexity
-----------------------|----------------|----------|----------|------------
Fixed Window Counter   | Yes (at edges) | Low      | Low      | Very Low
Sliding Window Log     | No             | High     | Perfect  | Medium
Sliding Window Counter | Partial        | Low      | High     | Medium
Token Bucket           | Yes            | Low      | High     | Medium
Leaky Bucket           | No             | Medium   | Perfect  | Medium

Interview tip: When asked which algorithm to choose, say "It depends on the use case." Then demonstrate you know the trade-offs. For a general external-facing API, the sliding window counter or token bucket are usually the right answer. Name them and explain why.

High-Level Architecture

Here's a distributed rate limiter that can handle millions of requests per second:

plaintext

                          ┌──────────────────────┐
                          │   Rate Limit Config   │
                          │   (DB / Config Store) │
                          └──────────┬───────────┘
                                     │ config reload
              ┌──────────────────────┼──────────────────────┐
              │                      │                      │
    ┌─────────▼──────┐    ┌──────────▼─────┐    ┌──────────▼─────┐
    │  App Server 1  │    │  App Server 2  │    │  App Server 3  │
    │  + RL Middleware│    │  + RL Middleware│    │  + RL Middleware│
    └─────────┬──────┘    └──────────┬─────┘    └──────────┬─────┘
              │                      │                      │
              └──────────────────────┼──────────────────────┘
                                     │ atomic Redis ops
                          ┌──────────▼───────────┐
                          │    Redis Cluster       │
                          │  (Primary + Replicas)  │
                          └──────────────────────┘

Key Components

Rate Limit Middleware

Runs in the request path on every app server
Extracts the limiting key (user ID, API key, or IP) from the request
Calls Redis atomically to check and update the counter
Returns HTTP 429 with appropriate headers if the limit is exceeded
Adds rate limit headers to all responses (allowed and rejected)

Redis Cluster

Stores all rate limiting state (counters, token buckets, timestamps)
Chosen for its sub-millisecond latency and native atomic operations (INCR, EXPIRE, Lua scripts)
Cluster mode distributes keys across multiple nodes for horizontal scaling
Replication ensures the cluster survives node failures

Rate Limit Config Store

Stores rules like: {user_tier: "free", endpoint: "/api/search", limit: 100, window: 60}
Rules are loaded into application memory at startup and refreshed periodically
Keeps rule changes out of the hot path

Handling Distributed Race Conditions

This is where most interview answers fall apart. When you have 50 app servers all reading and writing to Redis concurrently, naive implementations break.

The Problem

plaintext

Server 1: GET counter for user A → 99
Server 2: GET counter for user A → 99
Server 1: counter < 100, allow request. SET counter = 100
Server 2: counter < 100, allow request. SET counter = 100
Result: 2 requests allowed when limit was 100. User actually sent 101 total.

The Solutions

Solution 1: Redis INCR + EXPIRE (Fixed Window)

INCR is atomic — it reads and increments in a single operation. No two clients can read the same value.

python

count = redis.incr(key)  # atomic: read + increment + write
if count == 1:
    redis.expire(key, window_seconds)
return count <= limit

Solution 2: Lua Scripts (Token Bucket / Sliding Window Counter)

For multi-step operations (read → compute → write), use a Lua script. Redis executes Lua scripts atomically — the entire script runs as a single transaction.

lua

-- All reads and writes in one atomic block
local tokens = redis.call('GET', KEYS[1])
-- ... compute ...
redis.call('SET', KEYS[1], new_value)
-- No other command can interleave here

Solution 3: Redis Transactions (WATCH/MULTI/EXEC)

Redis optimistic locking — if the key changes between WATCH and EXEC, the transaction aborts and you retry. Works, but retries under high contention can add latency. Lua scripts are generally preferred.

What to say in an interview: Lead with Lua scripts. They're the cleanest, most production-proven approach. Mention Redis's single-threaded command execution as the reason they work.

Response Headers — Don't Forget These

A well-designed rate limiter communicates clearly with clients. Include these headers on every response:

plaintext

# For allowed requests
X-RateLimit-Limit: 1000          # The configured limit
X-RateLimit-Remaining: 742       # How many requests remain in this window
X-RateLimit-Reset: 1696118400    # Unix timestamp when the window resets
 
# For throttled requests (HTTP 429)
HTTP/1.1 429 Too Many Requests
Retry-After: 47                  # Seconds until the client can retry
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1696118400

These headers matter because they let API clients implement exponential backoff correctly instead of hammering your service.

Rate Limiting Rules & Policies

In a real system, rate limits aren't one-size-fits-all. You need a rules engine.

Example Rule Schema

json

{
  "rules": [
    {
      "id": "free-tier-global",
      "match": {"user_tier": "free"},
      "limit": 1000,
      "window_seconds": 3600,
      "algorithm": "sliding_window_counter"
    },
    {
      "id": "premium-search-endpoint",
      "match": {"user_tier": "premium", "endpoint": "/api/search"},
      "limit": 500,
      "window_seconds": 60,
      "algorithm": "token_bucket",
      "bucket_capacity": 1000,
      "refill_rate": 8.33
    },
    {
      "id": "unauthenticated-ip",
      "match": {"auth": false},
      "limit": 20,
      "window_seconds": 60,
      "algorithm": "fixed_window"
    }
  ]
}

Rule Evaluation Order

Rules are evaluated from most specific to least specific:

User ID + Endpoint (most specific)
User ID only
API Key
IP Address
Global fallback (least specific)

The first matching rule wins.

Failing Gracefully

Your rate limiter will eventually have a Redis outage. Decide your failure policy upfront.

Fail-Open

If Redis is unavailable, allow all requests through.

python

try:
    allowed = check_rate_limit(user_id)
except RedisConnectionError:
    allowed = True  # Fail open — don't block legitimate traffic
    log.error("Rate limiter unavailable, failing open")

Pro: Service stays available; no false rejections
Con: Abuse can slip through during the outage window

Fail-Closed

If Redis is unavailable, reject all requests.

python

except RedisConnectionError:
    return False  # Fail closed — reject everything

Pro: Consistent enforcement; protects downstream systems
Con: You're taking your own service down during a Redis failure

In practice: Most external-facing APIs fail-open and rely on circuit breakers and monitoring to catch the outage quickly. Financial or security-sensitive endpoints sometimes fail-closed.

Scaling the Rate Limiter

Redis Cluster Sharding

Rate limit keys are naturally distributed by user ID. Redis Cluster hashes keys to one of 16,384 hash slots, spread across nodes. This gives you horizontal scaling with no hotspots (unless you have a single "super user" sending millions of requests — handle that at the application layer with a per-user dedicated key strategy).

Local Cache Layer (For Extreme Scale)

At very high throughput (millions of req/sec), even Redis can be a bottleneck. Add a local in-memory counter on each app server that syncs to Redis every 100ms.

plaintext

Trade-off: A user could burst to N × limit for up to 100ms
(where N = number of app servers) before global enforcement kicks in.

This is the approach used by systems handling truly massive scale. Mention it as an optimization, along with the accuracy trade-off.

Read Replicas

For the rate limit config store (not the counters), use Redis replicas to spread read load across your app servers without hitting the primary for every rule lookup.

Monitoring and Alerting

Key Metrics to Track

Rate Limiter Health

Redis command latency (p50, p95, p99) — alert if p99 exceeds 5ms
Redis memory utilization — alert above 80%
Rate limiter middleware error rate — alert above 0.01%

Traffic Patterns

Requests throttled per minute (absolute and percentage of total)
Top throttled users / IPs (to identify abuse or misconfiguration)
429 response rate per endpoint

Business Metrics

False positive rate (legitimate traffic being throttled) — requires user feedback loops
Rate limit breach patterns (detect coordinated attacks)

Common Interview Follow-ups

"How would you rate limit GraphQL where a single query can be cheap or very expensive?"

GraphQL makes request-count-based limits almost useless — one query can fetch thousands of records while another fetches one field. The solution is cost-based rate limiting, where each query is assigned a complexity score based on the fields and depth requested, and the rate limit is enforced against cumulative complexity rather than raw request count. GitHub's GraphQL API does exactly this.

"How do you prevent a user from bypassing IP-based rate limiting?"

Sophisticated attackers rotate IPs or use proxy networks. Layered defences work better: combine IP-based limits with user-account-based limits, API key limits, and device fingerprinting. For the most abusive actors, use CAPTCHA challenges or require account verification.

"What if different regions have very different traffic patterns?"

Consider geo-distributed rate limiting where each region enforces limits independently against a regional Redis cluster, with periodic synchronisation to a global counter. This reduces cross-region latency at the cost of brief windows where a user could exceed their global limit by spreading requests across regions. Mention this as a trade-off and let the interviewer decide which direction to explore.

"How do you handle clock skew in a distributed system?"

Token bucket and sliding window calculations depend on accurate timestamps. If servers have clock skew, refill calculations become incorrect. Use Redis TIME command to get the Redis server's timestamp (single source of truth) rather than the application server's local clock. This eliminates skew issues entirely.

"What's the difference between rate limiting and throttling?"

Rate limiting is hard enforcement — requests over the limit are rejected immediately. Throttling is softer — requests are queued and delayed rather than dropped. Throttling is appropriate when brief delays are acceptable and you'd rather slow down a user than reject them outright. Leaky bucket is a throttling approach; token bucket and sliding window are rate limiting approaches.

Quick Interview Checklist

Before wrapping up your answer, make sure you've covered these:

✅ Clarified whether this is inbound API rate limiting, outbound call throttling, or both
✅ Chosen an algorithm and explained why (not just what)
✅ Addressed distributed correctness (race conditions, Lua scripts)
✅ Mentioned Redis as the storage backend and why (atomic ops, sub-millisecond latency)
✅ Discussed the failure mode (fail-open vs. fail-closed)
✅ Mentioned response headers (429, Retry-After, X-RateLimit-*)
✅ Covered how rules are configured and evaluated
✅ Talked about monitoring and what metrics matter

Conclusion

Rate limiters look deceptively simple until you start thinking about distributed systems. The real challenge isn't picking an algorithm — it's making sure the algorithm runs correctly when you have 50 app servers, a Redis cluster, and millions of requests hitting you simultaneously.

The key design decisions in a nutshell:

Algorithm: Sliding window counter or token bucket for most external APIs. Fixed window if simplicity is paramount and boundary gaming isn't a concern.
Storage: Redis, always. Its atomic operations (INCR, Lua scripts) are what make distributed rate limiting tractable.
Race conditions: Use Lua scripts. They execute atomically and eliminate the read-compute-write race.
Failure mode: Fail-open for most APIs; fail-closed for sensitive endpoints.
Communication: Always return Retry-After and X-RateLimit-* headers so clients can behave well.

Rate limiters are one of those problems that reward engineers who understand distributed systems fundamentals. Get the basics right, then show the interviewer you know the edge cases — that combination is what separates a good answer from a great one.

Rate limiting appears as a component in many larger system designs. See how it protects the API Gateway in E-Commerce System Design during flash sales, how it prevents paste spam in Pastebin System Design, and how it defends the dispatch API in DoorDash System Design.

Preparing for system design interviews? Mockingly.ai is built specifically for engineers who want to get serious about system design — whether it's backend, frontend, or mobile. Practice with realistic interview simulations and get the feedback loop you actually need before the real thing.