Skip to content
Jarviix

Tech · 6 min read

Designing Rate Limiters: Token Bucket, Leaky Bucket, and Sliding Windows

How rate limiters actually work — token bucket, leaky bucket, fixed and sliding windows — with the trade-offs that decide which one belongs in front of your API.

By Jarviix Engineering · Apr 19, 2026

Rows of green LEDs on a server in a darkened data center
Photo via Unsplash

A rate limiter is one of those tiny pieces of infrastructure that prevents an enormous number of bad days. Without one, a buggy client, a runaway script, or a credential-stuffing bot can swamp your service in minutes. With one, the same actor gets a polite 429 Too Many Requests and your service stays up.

This post walks through the four algorithms you'll see in real systems, when each one fits, and the trade-offs that decide which one belongs in front of your API.

What a rate limiter actually does

A rate limiter answers one question: for this client, in this window, have they exceeded the budget?

Three sub-decisions hide inside that question:

  1. What's the key? Per IP, per user, per API token, per route, or some combination.
  2. What's the budget? "100 requests per minute" — but per minute how?
  3. What happens when the budget runs out? Reject (429), queue, or slow down.

The algorithm choice is mostly about how you answer #2.

Algorithm 1: Fixed window

The simplest one. You count requests in clock-aligned windows — "between 12:00:00 and 12:00:59, this user made N requests".

12:00:00 - 12:01:00 → 100 requests max
12:01:00 - 12:02:00 → 100 requests max

Pros. Trivially simple. One counter per key, reset every window. Easy to implement on Redis with INCR + EXPIRE.

Cons. Burst at boundaries. A client can make 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in two seconds, all within budget. For a "100/min" limit, that's a 2× violation that the algorithm allows by design.

Reach for it when simplicity matters more than precision and you can absorb 2× bursts at window boundaries.

Algorithm 2: Sliding window log

Store a timestamp for every request the client made. On each new request, drop entries older than the window and count the rest.

def allow(key, limit, window):
    now = time.time()
    log = redis.zrangebyscore(key, now - window, now)
    if len(log) >= limit:
        return False
    redis.zadd(key, {request_id: now})
    redis.expire(key, window)
    return True

Pros. Mathematically exact. No boundary bursts.

Cons. Memory. Every request is a stored entry. For a 1000-req/min limit on a million users, that's a billion entries floating around at peak. Costly.

Reach for it when the limits are small, the request rate is low, or the precision is genuinely required (financial APIs, voting systems).

Algorithm 3: Sliding window counter

A clever middle ground. Keep two fixed-window counters (current and previous), and weight the previous one by how much of the current window has elapsed.

elapsed_in_current = 0.4   # we're 40% into the current minute
estimated = current_count + previous_count * (1 - 0.4)

This approximates a true sliding window without storing every request. It's the algorithm Cloudflare made famous in their public write-up.

Pros. Constant memory per key. Good accuracy — typically within a few percent of the true rate.

Cons. Slightly more code. Still a small approximation error during very bursty traffic.

Reach for it when you want sliding-window behavior at fixed-window cost. This is the safe default for most production rate limiters.

Algorithm 4: Token bucket

A different model entirely. Each client has a bucket of tokens. Every request consumes one. Tokens refill at a fixed rate up to a maximum.

bucket_size = 100   # max burst
refill_rate = 10/s  # sustained rate

Empty bucket → reject. Full bucket → 100 instant requests, then throttle to 10/s.

Pros. Models real-world bursty traffic well. Allows short bursts up to bucket size, then enforces a sustained rate. Stripe, AWS, and most cloud APIs use variants of this.

Cons. Two parameters to tune (size + refill rate) instead of one. Slightly less intuitive when explaining limits to users.

Reach for it when you want to allow short bursts (which most real APIs do — humans tap "refresh" three times) while still bounding sustained throughput.

Algorithm 5: Leaky bucket

Cousin of the token bucket. Requests go into a queue at variable rate; the queue drains at a fixed rate. Overflow → drop.

queue_size = 100
drain_rate = 10/s

The output rate is always smooth. Burstiness is absorbed by the queue.

Pros. Smooths traffic to a fixed downstream rate — useful when the limiter exists to protect a slow downstream system.

Cons. Adds latency (requests wait in the queue). Less natural for HTTP APIs where the client wants either "yes" or "no", not "wait".

Reach for it when the rate limiter is shaping traffic for a downstream system (a third-party API with a strict QPS cap, a database, a message broker), not throttling abusive clients.

Where to put the limiter

The cheapest rejected request is the one your application servers never see.

  • At the edge (CloudFlare, AWS WAF, your reverse proxy). Catches the worst abuse before it costs you anything.
  • At the API gateway (Kong, Tyk, Envoy). Per-key budgets enforced before requests fan into your services.
  • Inside the application. Last line of defense, useful for fine-grained per-user-per-resource rules, expensive operations, login throttles.

Most production systems do all three. The edge sheds volumetric attacks; the gateway enforces plan limits; the app handles fine-grained business rules.

Distributed limiters: the Redis pattern

The textbook implementation uses Redis with atomic operations. Sliding window counter in 5 lines:

-- KEYS[1] = "ratelimit:user:42:minute:N"
-- ARGV[1] = limit, ARGV[2] = ttl
local count = redis.call("INCR", KEYS[1])
if count == 1 then redis.call("EXPIRE", KEYS[1], ARGV[2]) end
if count > tonumber(ARGV[1]) then return 0 end
return 1

A Lua script makes it atomic across the read-modify-write. Without atomicity, two simultaneous requests can both pass a "below limit" check and both succeed when only one should.

For very high request rates, Redis itself can become the bottleneck. The escape hatches:

  • Local-first checks. Each app instance keeps a local counter and only syncs occasionally to Redis. Cheap; allows some over-budget bursts proportional to the number of instances.
  • Approximate counting. HyperLogLog or count-min sketches for very high cardinality.
  • Sharded Redis. Split the keyspace.

What to return

A good rate limiter is honest with clients. Beyond 429, return:

  • Retry-After: 30 — seconds to wait.
  • X-RateLimit-Limit: 100 — the budget.
  • X-RateLimit-Remaining: 0 — how much is left.
  • X-RateLimit-Reset: 1700000000 — when the budget refreshes.

This lets well-behaved clients back off cleanly. The script kiddies will ignore it, but the legitimate integrations you actually care about will respect it.

Rate limiting is one of those topics that connects to almost everything else. System design basics covers the building blocks the limiter sits between, and the HLD writeup of a rate limiter walks through an entire interview-grade design end to end.

Frequently asked questions

Should I rate limit per user or per IP?

Both. Per IP catches anonymous abuse and credential stuffing; per authenticated user catches plan limits and abusive accounts. If you only do one, do per IP for unauthenticated traffic and per user for authenticated traffic.

What's the difference between throttling and rate limiting?

Throttling slows clients down (queues, delays); rate limiting rejects them once they exceed a budget. They're often combined — slow them down at the soft limit, reject them at the hard limit.

Where should the limiter live?

As far upstream as possible — at your edge proxy or API gateway. The cheapest rejected request is the one your application servers never see.

Related Jarviix tools

Read paired with the calculator that does the math.

Read next