Distributed Locking
Distributed locks are fundamentally unsafe without storage-level validation like fencing tokens.
- Distributed locks are fundamentally unsafe without storage-level validation like fencing tokens.
- Garbage collection pauses and network partitions will cause lock lease expiry before execution completes.
- Use optimistic concurrency control or database-level constraints instead of distributed locks whenever possible.
- Redlock and similar multi-node consensus-less locking algorithms rely on dangerous assumptions about physical clock synchronization.
The Problem
In a distributed system, multiple worker nodes often need to coordinate access to a shared, non-distributed resource—such as an external API that cannot handle concurrent requests, or a legacy database without transaction isolation. If two workers attempt to modify this resource simultaneously, data corruption or inconsistent states occur.
Engineers frequently reach for a distributed lock to enforce mutual exclusion. However, in real-world networks, nodes experience unpredictable stop-the-world garbage collection (GC) pauses, hypervisor steals, and network partitions. A worker acquires a lock with a specific Time-To-Live (TTL), enters a GC pause that lasts longer than the TTL, and resumes execution. During this pause, the lock expires, a second worker acquires it, and both workers execute the critical section concurrently, leading to silent data corruption.
Core System Idea
To make distributed locking safe, the system must shift from relying on absolute time to using logical, monotonically increasing sequence numbers, commonly referred to as fencing tokens.
The core architecture pattern involves a centralized, strongly consistent coordination store (the lock manager) and a storage layer that acts as the final gatekeeper. When a client successfully acquires a lock, the lock manager returns not just a boolean success flag, but also a fencing token—a number that increments with every lock acquisition.
The storage layer must keep track of the highest fencing token it has processed. When a client attempts to write to the storage layer, it must present its fencing token. If a client with an expired lock attempts a write after a new client has acquired the lock (and thus received a higher token), the storage layer rejects the write because the incoming token is lower than the highest token already processed. This guarantees safety even when lock leases expire prematurely due to client-side pauses.
System Flow
The storage layer rejects Client 1's delayed write because its fencing token is lower than the already processed Token 22.
Real-World Examples Indicative
Google Chubby is the distributed lock service underpinning Bigtable and GFS. When a Chubby client acquires a lock, Chubby returns a "sequencer" — a monotonically increasing 64-bit integer bundled with the lock handle. Bigtable tablet servers validate every incoming RPC by requiring the caller to include its sequencer number. If a GC-paused client resumes after its lock expired and a new holder has received a higher sequencer, the tablet server rejects the stale write with SEQUENCER_TOO_OLD. The original Chubby paper (Mike Burrows, 2006) explicitly states that Bigtable's correctness relies entirely on sequencer validation at the tablet server level — the lock TTL alone is insufficient because clients cannot be trusted to honor it under GC or hypervisor pause.
HBase uses ZooKeeper for distributed locking on Region assignment. When the HMaster assigns a Region to a RegionServer, it creates an ephemeral sequential ZNode (e.g., /hbase/region-locks/region_abc-0000000042). The ZNode sequence number acts as the fencing token. If the RegionServer experiences a GC pause exceeding ZooKeeper's session timeout (default 90 seconds), ZooKeeper deletes the ZNode and the Region is reassigned — the new holder receives ZNode 0000000043. If the stale RegionServer resumes and attempts an HDFS write, the HMaster rejects it by comparing the incoming ZNode sequence against the current holder's ZNode. Without this ZNode sequence check, two concurrent RegionServers would corrupt overlapping HDFS blocks silently.
In February 2016, Martin Kleppmann published "How to do distributed locking," describing a concrete Redlock failure: a client acquires locks on 3 of 5 Redis nodes (quorum) with a 10-second TTL, then enters a 15-second stop-the-world GC pause. During the pause, the TTLs expire on all 5 nodes. A second client acquires locks on all 5 nodes. The first client resumes, believes its lock is still valid (its application-level clock advanced only milliseconds), and begins writing — both clients now hold Redlock simultaneously. Kleppmann's critique: Redlock's safety depends on bounded process pauses and bounded clock drift, neither of which Redis can enforce. Redis author Salvatore Sanfilippo responded publicly, but the fundamental issue — that Redlock provides only probabilistic safety — was not resolved.
Anti-Patterns
Acquiring a distributed lock to update a SQL database instead of using native database constraints, optimistic concurrency control (OCC) via version columns, or SELECT ... FOR UPDATE transactions.
Acquiring a lock from Redis or ZooKeeper and executing a write to a third-party API or database without passing or validating a monotonic sequence number.
Setting a very short lock TTL to minimize recovery time during worker crashes, without implementing an out-of-band background thread to actively renew (heartbeat) the lock while the worker is still processing.
Deploying Redlock in environments where clock drift, NTP adjustments, or VM pauses can violate the assumption of synchronized physical time, leading to concurrent lock ownership.
Design Tradeoffs
| Dimension | CP-Based Lock (ZooKeeper) | AP-Based Lock (Redis/Redlock) |
|---|---|---|
| Consistency guarantee | Strict mutual exclusion; quorum agreement prevents two clients from concurrently holding the lock even during network partitions | Probabilistic exclusion; assumes bounded clock drift and bounded process pause times — guarantees Redis cannot enforce |
| Operational complexity | High; requires a dedicated ZooKeeper or etcd ensemble with strict fsync and quorum configuration | Low; single or multi-node Redis with simple TTL-based locking logic; minimal operational overhead |
| Lock release on crash | Automatic; the ephemeral ZNode is deleted when the client session expires, immediately freeing the lock for the next holder | TTL-based; the lock persists until the TTL elapses — a crashed holder blocks new acquisitions for the full TTL duration |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Coordinating access to external, third-party APIs that do not support transactions or versioning. | The target storage engine supports native transactions, optimistic concurrency control, or unique constraints. |
| Performing coarse-grained, long-running background tasks where only one worker should run at a time (e.g., nightly report generation). | High-throughput, low-latency transaction processing is required; locking introduces severe performance bottlenecks. |
| Implementing leader election for a stateless worker pool where occasional duplicate execution is costly but not catastrophic. | The system requires absolute, mathematically proven linearizability without any possibility of double-writes. |