Distributed Coordination · #59

Distributed Locking

Distributed locks are fundamentally unsafe without storage-level validation like fencing tokens.

Published May 29, 2026 · By MortalApps · 5 min read · 1,030 words

TL;DR

Distributed locks are fundamentally unsafe without storage-level validation like fencing tokens.
Garbage collection pauses and network partitions will cause lock lease expiry before execution completes.
Use optimistic concurrency control or database-level constraints instead of distributed locks whenever possible.
Redlock and similar multi-node consensus-less locking algorithms rely on dangerous assumptions about physical clock synchronization.

Problem Idea Flow Examples Anti-patterns Tradeoffs Best Practices Related

The Problem

In a distributed system, multiple worker nodes often need to coordinate access to a shared, non-distributed resource—such as an external API that cannot handle concurrent requests, or a legacy database without transaction isolation. If two workers attempt to modify this resource simultaneously, data corruption or inconsistent states occur.

Engineers frequently reach for a distributed lock to enforce mutual exclusion. However, in real-world networks, nodes experience unpredictable stop-the-world garbage collection (GC) pauses, hypervisor steals, and network partitions. A worker acquires a lock with a specific Time-To-Live (TTL), enters a GC pause that lasts longer than the TTL, and resumes execution. During this pause, the lock expires, a second worker acquires it, and both workers execute the critical section concurrently, leading to silent data corruption.

Core System Idea

To make distributed locking safe, the system must shift from relying on absolute time to using logical, monotonically increasing sequence numbers, commonly referred to as fencing tokens.

The core architecture pattern involves a centralized, strongly consistent coordination store (the lock manager) and a storage layer that acts as the final gatekeeper. When a client successfully acquires a lock, the lock manager returns not just a boolean success flag, but also a fencing token—a number that increments with every lock acquisition.

The storage layer must keep track of the highest fencing token it has processed. When a client attempts to write to the storage layer, it must present its fencing token. If a client with an expired lock attempts a write after a new client has acquired the lock (and thus received a higher token), the storage layer rejects the write because the incoming token is lower than the highest token already processed. This guarantees safety even when lock leases expire prematurely due to client-side pauses.

System Flow

The storage layer rejects Client 1's delayed write because its fencing token is lower than the already processed Token 22.

Real-World Examples Indicative

Google Chubby — sequencer tokens in Bigtable, zero stale writes across 10M+ RPS

Google Chubby is the distributed lock service underpinning Bigtable and GFS. When a Chubby client acquires a lock, Chubby returns a "sequencer" — a monotonically increasing 64-bit integer bundled with the lock handle. Bigtable tablet servers validate every incoming RPC by requiring the caller to include its sequencer number. If a GC-paused client resumes after its lock expired and a new holder has received a higher sequencer, the tablet server rejects the stale write with SEQUENCER_TOO_OLD. The original Chubby paper (Mike Burrows, 2006) explicitly states that Bigtable's correctness relies entirely on sequencer validation at the tablet server level — the lock TTL alone is insufficient because clients cannot be trusted to honor it under GC or hypervisor pause.

HBase ZooKeeper — ephemeral sequential znodes as fencing tokens for RegionServer assignment

HBase uses ZooKeeper for distributed locking on Region assignment. When the HMaster assigns a Region to a RegionServer, it creates an ephemeral sequential ZNode (e.g., /hbase/region-locks/region_abc-0000000042). The ZNode sequence number acts as the fencing token. If the RegionServer experiences a GC pause exceeding ZooKeeper's session timeout (default 90 seconds), ZooKeeper deletes the ZNode and the Region is reassigned — the new holder receives ZNode 0000000043. If the stale RegionServer resumes and attempts an HDFS write, the HMaster rejects it by comparing the incoming ZNode sequence against the current holder's ZNode. Without this ZNode sequence check, two concurrent RegionServers would corrupt overlapping HDFS blocks silently.

Martin Kleppmann's Redlock critique — 15s GC pause triggers dual lock ownership across 5 Redis nodes

In February 2016, Martin Kleppmann published "How to do distributed locking," describing a concrete Redlock failure: a client acquires locks on 3 of 5 Redis nodes (quorum) with a 10-second TTL, then enters a 15-second stop-the-world GC pause. During the pause, the TTLs expire on all 5 nodes. A second client acquires locks on all 5 nodes. The first client resumes, believes its lock is still valid (its application-level clock advanced only milliseconds), and begins writing — both clients now hold Redlock simultaneously. Kleppmann's critique: Redlock's safety depends on bounded process pauses and bounded clock drift, neither of which Redis can enforce. Redis author Salvatore Sanfilippo responded publicly, but the fundamental issue — that Redlock provides only probabilistic safety — was not resolved.

Anti-Patterns

Using Locks for Database Writes

Acquiring a distributed lock to update a SQL database instead of using native database constraints, optimistic concurrency control (OCC) via version columns, or SELECT ... FOR UPDATE transactions.

Ignoring Fencing Tokens

Acquiring a lock from Redis or ZooKeeper and executing a write to a third-party API or database without passing or validating a monotonic sequence number.

Short TTLs Without Auto-Renewal

Setting a very short lock TTL to minimize recovery time during worker crashes, without implementing an out-of-band background thread to actively renew (heartbeat) the lock while the worker is still processing.

Relying on Redlock for Hard Correctness

Deploying Redlock in environments where clock drift, NTP adjustments, or VM pauses can violate the assumption of synchronized physical time, leading to concurrent lock ownership.

Design Tradeoffs

Dimension	CP-Based Lock (ZooKeeper)	AP-Based Lock (Redis/Redlock)
Consistency guarantee	Strict mutual exclusion; quorum agreement prevents two clients from concurrently holding the lock even during network partitions	Probabilistic exclusion; assumes bounded clock drift and bounded process pause times — guarantees Redis cannot enforce
Operational complexity	High; requires a dedicated ZooKeeper or etcd ensemble with strict fsync and quorum configuration	Low; single or multi-node Redis with simple TTL-based locking logic; minimal operational overhead
Lock release on crash	Automatic; the ephemeral ZNode is deleted when the client session expires, immediately freeing the lock for the next holder	TTL-based; the lock persists until the TTL elapses — a crashed holder blocks new acquisitions for the full TTL duration

Best Practices

Enforce Storage-Level ValidationAlways design the target resource to validate incoming fencing tokens or version numbers before committing any state changes.

Implement Background Lock RenewalRun a dedicated background thread or coroutine to periodically extend the lock TTL while the main processing thread is actively working on the critical section.

Use Monotonic Clocks for TimeoutsEnsure all client-side timeout calculations use monotonic clocks rather than system wall-clock time to prevent NTP synchronization from skewing lock durations.

Fail Fast on Heartbeat LossIf the lock renewal thread fails to contact the lock manager within a safety threshold, immediately halt the main processing thread to prevent split-brain execution.

When to Use / Avoid

Use When	Avoid When
Coordinating access to external, third-party APIs that do not support transactions or versioning.	The target storage engine supports native transactions, optimistic concurrency control, or unique constraints.
Performing coarse-grained, long-running background tasks where only one worker should run at a time (e.g., nightly report generation).	High-throughput, low-latency transaction processing is required; locking introduces severe performance bottlenecks.
Implementing leader election for a stateless worker pool where occasional duplicate execution is costly but not catastrophic.	The system requires absolute, mathematically proven linearizability without any possibility of double-writes.