Should I use sticky sessions?

Almost always no. They limit your scaling options and complicate failover. The right answer is to make your servers stateless and store sessions in Redis or a database.

L4 or L7 for a typical web app?

L7 (HTTP-aware) for the public ingress — you need path-based routing, header inspection, TLS termination. L4 if you're routing arbitrary TCP/UDP or care about absolute peak throughput.

What does 'least connections' actually do?

Sends each new request to the backend currently handling the fewest active connections. Better than round robin when request durations vary widely (some take 50ms, others take 5 seconds).

Load Balancing Strategies: L4 vs L7, Round Robin, and What 'Sticky Sessions' Really Cost

Load balancers are the part of the infrastructure stack that everyone uses and few people understand. They look simple — distribute requests across servers — and the distribution algorithm has surprisingly large effects on tail latency, blast radius, and how your system fails under load.

This post walks through the two layers (L4 and L7), the algorithms that actually matter, and the trade-offs that decide which configuration fits your system.

L4 vs L7: what's the difference

The OSI layer the load balancer operates at:

L4 (transport layer)

The load balancer sees TCP packets. It looks at the destination IP and port; it does not see HTTP headers or paths.

Pros. Extremely fast — minimal per-packet processing. Works for any TCP/UDP protocol (databases, custom protocols, gRPC streaming). Stable connections.
Cons. Can't make routing decisions based on URL, host header, or other content.

Used heavily for: database load balancing, internal service-to-service traffic at large scale, raw network appliances.

L7 (application layer)

The load balancer terminates the connection, parses HTTP, and routes based on URL, host, headers, cookies.

Pros. Path-based routing (/api/* → backend A, /static/* → backend B). TLS termination. Header rewriting. Health checks tied to actual application behavior.
Cons. More CPU per request. Single connection from client gets re-multiplexed across multiple backend connections.

Used heavily for: public-facing web traffic, API gateways, ingress controllers.

Most production setups have both: an L4 load balancer in front (for raw throughput and DDoS protection), and L7 inside (for application-level routing).

The algorithms that matter

Round robin

Send each new request to the next backend in sequence.

req 1 → backend A
req 2 → backend B
req 3 → backend C
req 4 → backend A
...

Pros. Trivially simple. No state required.

Cons. Assumes all backends are equally fast and all requests cost the same. Both are usually false.

Reach for it when you have homogeneous backends and roughly uniform requests.

Weighted round robin

Round robin, but each backend gets a weight.

A: weight 5, B: weight 1
→ A, A, A, A, A, B, A, A, A, A, A, B, ...

Useful when backends have different capacities (canary at 10%, large instances at higher weight).

Least connections

Send each new request to the backend with the fewest in-flight connections.

A: 12 connections, B: 7 connections, C: 9 connections
→ next request goes to B

Pros. Adapts to varying request durations. A backend stuck on a slow request gets fewer new ones.

Cons. Requires per-backend connection tracking. Slightly more complex.

This is often the safest default for L7 ingress balancing — it handles uneven request durations gracefully.

Least response time

Like least connections, but uses recent response time as a signal.

A: 80ms p50, B: 220ms p50, C: 100ms p50
→ next requests prefer A

Adapts faster to backend slowness. Useful when one backend is degraded but not yet failing health checks.

IP hash / consistent hashing

Hash the client IP (or session key) to pick a backend. Same client always lands on the same backend.

hash(client_ip) % N → backend index

Pros. Cache locality — the same user's data is hot on the same backend. Useful for in-memory caches per backend.

Cons. Adding/removing backends reshuffles a lot of clients (vanilla mod-hashing). Use consistent hashing to minimize reshuffling.

Random

Yes, really. Send each request to a random backend.

Pros. No state, no coordination. Surprisingly competitive with round robin in practice.

Cons. Same blind spots as round robin.

The "Power of Two Choices" variant — pick two backends at random, send to the less loaded — is mathematically very good and used by many modern load balancers (NGINX, HAProxy, Envoy all support it).

Sticky sessions

The pattern: tie a user's requests to a specific backend (via cookie, IP hash, or other key).

The pitch. Your application state lives in memory on a particular server. Routing the same user back keeps that state warm.

The reality. Almost always a mistake in modern systems:

Failover is harder. Backend dies → user's "session" is gone unless backed up elsewhere.
Scaling is uneven. Some backends end up with 100 sticky users, others with 5.
Capacity planning is harder. You can't safely drain one backend without impacting users pinned to it.
Deploys are harder. "Drain connections, wait, restart" gets long when sessions are long-lived.

The right answer almost always. Make your servers stateless. Store session state in Redis or a database. Any backend can serve any request. Round robin or least connections distributes load evenly. Failures are invisible to users.

The exception: WebSockets and long-lived connections genuinely need stickiness because the connection itself is bound to a particular backend. For these, sticky-by-connection (not sticky-by-user) is the right pattern.

Health checks

Every load balancer needs to know which backends are alive. Two approaches:

Active health checks

The LB pings each backend periodically (typically /health or a TCP probe).

Every 5s: GET /health
2 consecutive failures → mark unhealthy, stop routing
3 consecutive successes → mark healthy, resume routing

Most production systems use this. Tune:

Interval: trade detection speed vs probe overhead. 5-10s is typical.
Failure threshold: trade speed vs flakiness sensitivity.
Probe depth: "is the process alive" vs "can it actually serve requests" (the latter catches more real problems but is more expensive).

Passive health checks

Watch real traffic. If a backend returns 5xx repeatedly, mark unhealthy.

Less overhead, slightly slower to react. Often combined with active checks.

Connection draining

When you take a backend out of service (deploy, scale down, retirement), you don't want to terminate it mid-request. Connection draining:

LB stops sending new requests to the backend.
In-flight requests are allowed to complete (within a deadline).
After the deadline, the backend is forcibly removed.

Good drain timeouts depend on your typical request duration: 30s for typical web traffic, longer for long-running endpoints, much longer for WebSockets.

TLS termination

Where does HTTPS get decrypted?

At the LB. Cheaper for backends; backends speak plain HTTP. LB sees the request content (good for L7 routing).
End-to-end (passthrough). LB doesn't terminate; backends handle TLS themselves. More secure for sensitive traffic; requires backends to manage certs.
Re-encrypted. LB terminates client TLS, then makes a new TLS connection to backend. Sees content (for L7 routing) and encrypts the LB→backend hop.

For public traffic on most stacks: terminate at the LB, re-encrypt to backends if you're in a regulated environment, plain HTTP to backends otherwise.

Three rules

Make backends stateless. Sticky sessions limit your scaling options forever. Pay the upfront cost of moving session state to Redis or a database.
Pick algorithms based on request shape, not vibes. Uniform short requests: round robin or random. Variable durations: least connections. Cache locality matters: consistent hashing.
Test failure modes. Most load balancer config is happy-path code. Break a backend, watch the metrics. Slow a backend (don't kill it) and see if your algorithm routes around it.

What to read next

Load balancing fits into the broader picture of system design basics. How CDNs work covers the layer above (geographically distributed edge caching), and microservices observability is what you need to actually see what your load balancer is doing in production. The YouTube/Netflix HLD writeup walks through how multi-tier load balancing — global DNS, regional, then per-rack — actually composes in a video-scale system.