What's the difference between rate limiting and throttling?

Often used interchangeably but technically different. Rate limiting enforces a hard cap — once limit is reached, requests are rejected (with 429 status). Throttling slows down requests after a threshold — the request is still served but with added delay or queued. In practice, most production systems combine both: throttle moderately abusive clients, hard-limit egregious ones. APIs use rate limiting; some streaming or processing systems prefer throttling.

Where should rate limiting live in my stack?

Multiple layers, ideally. (1) Edge/CDN — coarse limits per IP, blocks DDoS. (2) API gateway — per-API-key limits. (3) Application — per-endpoint or per-user limits. (4) Database — connection-level limits. Each layer protects different resources. Don't rely solely on application-level limits — by then, the request has already consumed connection, parsing, and routing resources.

How do I rate limit across distributed servers?

Centralized counter (typically Redis with INCR + EXPIRE) is the most common approach. Each request increments a counter for the key+window; if count exceeds limit, reject. Pros: simple, accurate. Cons: Redis becomes a critical dependency; high request rates create Redis hotspots. Alternatives: probabilistic algorithms (allow approximate limits with no central counter), local counters with periodic synchronization, or distributed token buckets (rare but precise).

API Rate Limiting Strategies: Token Bucket, Leaky Bucket, and Sliding Window

Rate limiting is the most important defensive feature most APIs lack. Without it, a single misbehaving client (intentional or accidental) can saturate your servers, exhaust connection pools, and degrade service for everyone. Production-quality APIs always rate limit; well-designed APIs do it transparently and predictably.

This post covers the major rate limiting algorithms, when each is appropriate, and the implementation considerations that matter in distributed systems.

Why rate limit

Protection from abuse

Misbehaving clients (bots, misconfigured scripts, malicious actors) can generate orders of magnitude more traffic than normal users. Without limits, they degrade service for everyone.

Cost control

Many APIs have downstream costs (third-party API calls, database queries, AI inference). Rate limiting prevents runaway costs from a single client.

Fairness

Without limits, "loud" users get more service than quiet ones. Rate limiting ensures a baseline service level for all users.

Capacity planning

Predictable per-client limits make total capacity calculations possible.

Compliance

Some regulations (PCI, HIPAA) require rate limiting on certain endpoints.

Rate limiting algorithms

Fixed Window

Count requests in fixed time windows (e.g., per minute). When window resets, counter resets.

Implementation: Counter per key per window. Increment on request; reject if over limit.

Pros: Simple, low memory. Cons: Burst at window boundaries — client could send 2x limit in 1 second across the boundary.

Example: 100 requests/minute. Client sends 100 at 12:00:59 and another 100 at 12:01:00 — 200 requests in 1 second, both within limit.

When to use: Coarse-grained limits where boundary bursts don't matter.

Sliding Window Log

Maintain a log of timestamps for each request; count requests in the last N seconds.

Implementation: Sorted set or list of timestamps. Remove old entries; count remaining; reject if too many.

Pros: Most accurate; truly enforces "X requests per Y seconds." Cons: Memory intensive (one entry per request); operations grow linearly with request rate.

When to use: Precision is critical; request rates are moderate.

Sliding Window Counter

Approximation of sliding window log. Maintain counters for current and previous fixed window; weighted average based on time within current window.

Implementation: Two counters per key. Compute approximate count = (previous_count × overlap_ratio) + current_count.

Pros: Memory efficient like fixed window; smoother than fixed window. Cons: Approximation; can over-count or under-count slightly.

When to use: Best general-purpose choice for HTTP APIs. Used by many production systems.

Token Bucket

Tokens are added to a bucket at fixed rate; each request consumes a token. If bucket is empty, request is rejected (or throttled).

Implementation: Per-key state: current token count and last refill time. On request: refill tokens based on elapsed time, deduct one if available.

Parameters:

Bucket size (max burst)
Refill rate (sustained throughput)

Pros: Allows controlled bursts; smooths traffic over time; flexible parameters. Cons: Slightly more complex; requires per-key state.

When to use: When burst tolerance is desired; most user-facing APIs.

Leaky Bucket

Requests enter a queue; processed at fixed rate. If queue is full, new requests rejected.

Implementation: Queue with fixed leak rate.

Pros: Smooths output rate completely; useful for downstream protection. Cons: Adds latency (requests wait in queue); doesn't allow bursts.

When to use: Protecting downstream services that need very smooth input rate.

Choosing parameters

Limit value

Should be high enough that legitimate users never hit it; low enough to prevent abuse.

For user-facing APIs: 60-300 requests/minute typical for authenticated users. For server-to-server APIs: 1000-10,000 requests/minute typical. For public/anonymous: 10-100 requests/minute typical.

Window size

Per-second: prevents fine-grained burst; for very sensitive endpoints
Per-minute: standard for most APIs
Per-hour: for usage-based billing
Per-day: for monthly quota management

Use multiple windows simultaneously: 100/minute AND 5000/hour, for example.

Per-what?

Per IP: easy but defeated by NAT (multiple users behind one IP) and proxies
Per API key: standard for B2B APIs; clean isolation
Per user account: standard for B2C APIs
Per endpoint × per user: granular control over expensive endpoints

Most production APIs combine: per-IP globally, per-key per-endpoint specifically.

Distributed rate limiting

Single-server rate limiting is easy. Distributed systems are harder.

Centralized counter (Redis)

Most common approach. Atomic operations on Redis (INCR, EXPIRE) maintain accurate counters.

Pros: Accurate, simple to reason about. Cons: Redis becomes critical dependency; per-key hotspots possible at extreme rates.

Local counters with periodic sync

Each server maintains local counters; syncs to central store every few seconds.

Pros: Low latency per request. Cons: Approximate; possible to exceed limits during sync windows.

Probabilistic algorithms

Allow each server to permit requests probabilistically based on its share of total capacity.

Pros: No central coordination; scales infinitely. Cons: Approximate; harder to reason about.

Distributed token bucket

Token state replicated across multiple Redis nodes via consensus. Used in some cloud providers.

Pros: Accurate, distributed. Cons: Operationally complex.

For most companies: Redis-based centralized counter is the right choice. Build for the future when scale demands more sophisticated approaches.

Communicating limits to clients

HTTP headers (standard)

X-RateLimit-Limit: total limit
X-RateLimit-Remaining: remaining in current window
X-RateLimit-Reset: timestamp when window resets
Retry-After: seconds until next allowed request (on 429 response)

Response codes

429 Too Many Requests: standard rate-limit response
503 Service Unavailable: less common, sometimes used for system-wide overload

Documentation

Clearly document limits per endpoint. Provide example client code that respects limits.

Client behavior

Well-behaved clients should:

Respect Retry-After headers
Implement exponential backoff on 429
Cache responses to reduce request rate
Batch operations where possible

Rate limiting at different layers

CDN / Edge

Coarse limits per IP. Stops DDoS before it reaches your infrastructure.

Common: AWS WAF, Cloudflare Rate Limiting, Fastly.

API Gateway

Per-API-key limits. Stops abusive integrations.

Common: AWS API Gateway, Kong, Tyk.

Application

Per-endpoint, per-user limits. Most flexible; closest to business logic.

Implement in middleware (Express, Spring Boot interceptors, Django middleware).

Database

Connection pool limits. Last line of defense against runaway queries.

Use connection pool max sizes; query timeouts.

Common mistakes

No rate limiting at all: surprise outages from accidental loads
Same limit for all endpoints: expensive endpoints (search, ML inference) need lower limits than cheap ones (status checks)
No headers communicated: clients can't behave well if they don't know limits
429 without Retry-After: clients retry immediately, doubling load
Per-IP only: defeated by NAT, proxies, mobile networks
Hard limit only, no throttling: all-or-nothing rejection; even brief overages cause failures
No monitoring of rate limit hits: missing important signal about user behavior or attacks
Identical limits in dev/staging/prod: dev hits limits constantly; complains about "production issue"

Beyond rate limiting

Adjacent techniques:

Quota systems: monthly/annual usage limits for billing
Concurrent connection limits: max simultaneous connections per client
Request size limits: max request body size
Bandwidth limits: max bytes/second per connection
Adaptive limits: dynamically adjust based on system health

Production systems combine multiple defenses.

What to read next

API versioning strategies — long-term API design.
Idempotency in APIs — handle retries safely.
Load balancers deep dive — LBs often handle rate limiting too.
System design basics — broader API design context.

Rate limiting isn't optional for production APIs. The choice of algorithm, parameters, and implementation determines whether your API serves users predictably or fails embarrassingly under load. Start with sliding window counter on Redis with per-API-key limits, communicate via standard headers, and iterate based on real traffic patterns. The investment pays off in stability, cost control, and operational sanity.