Skip to content
Jarviix

Tech · 7 min read

Distributed Locks: Redlock, Zookeeper, and Why They're Harder Than They Look

When you need a distributed lock, what your real options are (Redis, Zookeeper, Etcd), and the failure modes that make 'just use Redlock' a worse answer than it sounds.

By Jarviix Engineering · Apr 19, 2026

Abstract neon glowing network nodes on a dark background
Photo via Unsplash

Distributed locks are one of those pieces of infrastructure that look simple on the whiteboard and turn into operational landmines in production. "Just use Redis SETNX" or "we use Redlock" sounds reasonable until the day a network blip causes two workers to both believe they hold the lock — and your system writes the same row twice.

This post covers what distributed locks are, why they're harder than they look, and the patterns that actually work.

Why locks are hard in distributed systems

In a single process, a mutex is a memory operation. The kernel guarantees mutual exclusion. The lock either is held or isn't; there's no ambiguity.

Across machines, none of those guarantees survive:

  • Clocks drift. Two machines disagree about what time it is. "Lock expires at T+30s" means different things to different parties.
  • Networks delay. A worker thinks it still holds the lock; the lock service thinks it expired and gave it to someone else.
  • Processes pause. Garbage collection, swap, hypervisor scheduling, kubelet eviction — your process can be frozen for seconds without knowing.
  • Acks lie. A "lock acquired" response can be in flight while the network partitions.

A distributed lock is fundamentally an attempt to provide the illusion of mutual exclusion despite all of these. It's a leaky abstraction; understanding where it leaks is the difference between "works most of the time" and "correct".

The standard options

Redis-based (SETNX, Redlock)

Simplest setup. Use SET key value NX PX 30000 — set the key if it doesn't exist, with a 30-second expiry.

def acquire_lock(key, owner, ttl_ms):
    return redis.set(key, owner, nx=True, px=ttl_ms)

def release_lock(key, owner):
    # Lua script to ensure we only release a lock we still own
    script = """
    if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('del', KEYS[1])
    else
        return 0
    end
    """
    redis.eval(script, 1, key, owner)

Pros. Trivially simple. Fast (single Redis op). Available — you probably already run Redis.

Cons. Correctness depends on:

  • Clocks being roughly synchronized between Redis and clients.
  • Processes not pausing past the TTL while believing they hold the lock.
  • Redis primary not failing right after granting the lock but before replicating to followers.

For best-effort coordination (deduplicating background jobs, soft rate limits), this is fine. For correctness-critical mutual exclusion, it's not — Martin Kleppmann's analysis of the Redlock algorithm covers the failure modes in detail.

Redlock (multiple Redis instances)

Antirez's improvement: acquire the lock on a majority of N independent Redis instances. Quorum protects against a single Redis failing.

def acquire(key, ttl_ms, redis_nodes):
    start = time.monotonic_ns()
    acquired = []
    for node in redis_nodes:
        if node.set(key, owner, nx=True, px=ttl_ms):
            acquired.append(node)
    elapsed_ms = (time.monotonic_ns() - start) / 1e6
    if len(acquired) >= len(redis_nodes) // 2 + 1 and elapsed_ms < ttl_ms:
        return True
    for node in acquired:
        node.del(key)
    return False

Stronger than single-node Redis, but the Kleppmann critique still applies: under specific clock-drift + process-pause scenarios, Redlock can grant a lock to two clients simultaneously. For workloads where "two workers briefly both think they have the lock" is acceptable, fine. For workloads where it isn't, use a system designed for consensus.

ZooKeeper

The classic correctness-focused choice. ZooKeeper provides linearizable writes and ephemeral nodes that disappear when the holder's session ends.

Client creates ephemeral sequential node /lock/lock-N
Client lists /lock/* — if its node has the smallest sequence number, it holds the lock
Otherwise, watch the node immediately preceding it; when it disappears, recheck

Pros. Genuinely correct mutual exclusion. Sessions auto-expire if the client crashes. Battle-tested (Hadoop, Kafka, HBase use it).

Cons. Heavy to operate. Java-centric (though clients exist in many languages). Slower than Redis. Often overkill for simple use cases.

Etcd

The modern kid. Built on Raft consensus, similar guarantees to ZooKeeper, gentler operational model, used heavily in the Kubernetes world.

lease = etcd.lease.grant(ttl=30)
key = "/locks/my-lock"
success, _ = etcd.transaction(
    compare=[etcd.transactions.create(key) == 0],
    success=[etcd.transactions.put(key, "owner", lease)],
    failure=[],
)

Pros. Strong consistency. Lease-based timeouts auto-cleanup crashed holders. Good Go and HTTP APIs.

Cons. Operational dependency on etcd cluster (you probably have one if you run Kubernetes). Same speed and complexity ballpark as ZooKeeper.

Database-based (advisory locks, row locks)

If your services share a Postgres, you have a perfectly good distributed lock primitive: advisory locks.

-- Acquire (waits if held)
SELECT pg_advisory_lock(42);

-- Try to acquire (returns immediately)
SELECT pg_try_advisory_lock(42);

-- Release
SELECT pg_advisory_unlock(42);

Or row locks via SELECT FOR UPDATE.

Pros. Same correctness guarantees as your database. No extra infrastructure. Crash recovery handled by the database.

Cons. Not free — every locked operation holds a database connection. Doesn't scale to high lock contention.

The fencing token pattern

The most important defense against the "two workers both think they hold the lock" failure mode: never trust a lock blindly; require a monotonically increasing fencing token.

1. Acquire lock → receive fencing token N (monotonically increasing).
2. Send N along with every write to the protected resource.
3. The resource rejects writes with a token < the highest it has seen.

If two workers both believe they hold the lock, only the one with the higher token can write. The "stale" worker tries to write with token 41, but the resource has already accepted a write with token 42 and rejects it.

This pattern requires the protected resource to support fencing tokens. ZooKeeper and Etcd give you them naturally (their consensus log provides monotonically increasing positions). Redis Redlock does not — which is the root of much of the criticism.

Patterns that avoid distributed locks

Often the right move isn't to use a better lock — it's to design the system so you don't need one.

Single-writer assignments

Assign each piece of work to a single owner deterministically (by hash, by partition, by tenant). If only one worker can ever process a given key, no lock needed.

This is how Kafka consumer groups work: each partition is owned by exactly one consumer in a group. No lock primitives, no contention.

Optimistic concurrency

Read state with a version number. Write with a "where version = X" predicate. If no rows update, retry.

UPDATE accounts SET balance = $new, version = version + 1
WHERE id = $id AND version = $expected_version;

Fast in low-conflict scenarios; degenerates to retries under contention. Most online workloads are low-conflict.

Idempotency

Make the operation safe to do twice. Then concurrent execution is annoying, not catastrophic. (See idempotency in APIs.)

Outbox pattern

For "exactly once" guarantees on emitting events, use the outbox pattern instead of a distributed lock between the database write and the event publication. (See event-driven architecture.)

When you genuinely need a lock

After exhausting the alternatives:

  • Leader election. Exactly one instance must be active at a time (a scheduler, a coordinator). Use ZooKeeper or Etcd ephemeral nodes.
  • Singleton background jobs. "Run this cron exactly once across the cluster." Redis SETNX is fine for low-stakes cases; Etcd for high-stakes.
  • Critical sections that can't be sharded. Some legacy systems require true mutual exclusion. Use a system designed for consensus.

Three rules

  1. Match the lock to the consequences. A lock for "skip this duplicate background job" can be Redis SETNX. A lock for "don't double-charge this customer" must be consensus-backed and use fencing tokens.
  2. Always set a TTL. Locks without expiry deadlock the system when the holder dies. Always have a timeout, always plan for what happens at expiry.
  3. Prefer designs that don't need locks at all. Single-writer partitions, idempotency, optimistic concurrency are all simpler and more scalable than distributed locks.

Distributed locks are usually a hint that something else in the design needs revisiting. Idempotency in APIs and event-driven architecture cover the patterns that most often replace the need for them. CAP theorem in practice explains the consistency model these locks live inside. The Uber HLD writeup is a useful applied read — ride dispatch is exactly the kind of "exactly one driver gets this trip" problem where you confront these choices for real.

Frequently asked questions

Can I just use a Redis SETNX as a lock?

For best-effort coordination on non-critical work, yes. For correctness-critical mutual exclusion (money, allocations), no — Redis-based locks have known correctness issues under specific failure modes.

What about ZooKeeper or Etcd?

Both are designed for coordination and are correct under the assumptions Redis-based locks struggle with. They're heavier to operate but they're the right answer for serious mutual exclusion.

Is there a way to avoid distributed locks entirely?

Often yes — and it's usually the right move. Single-writer assignments, sharding by key, optimistic concurrency, and idempotency all sidestep the need for a lock. Reach for locks only when you've ruled these out.

Related Jarviix tools

Read paired with the calculator that does the math.

Read next