Correctness in concurrent systems comes down to the question: how sure can we be of the state we are observing? For shared resources we may want to prevent concurrent writes and reads to a resource only one client should hold a lock to avoid data races.

Locks as a concurrency primitive in programming languages are well known and safe to use when applied correctly. Locks in distributed systems are much more difficult to get right.

Redlock

Redlock is an algorithm to implement a distributed lock manager using Redis with implementations in a variety of languages which makes it a popular choice. There is contention about its correctness which is worth reviewing to also gain insights into the complexity of distributed algorithms.

Acquiring a lock using the Redlock algorithm requires performing five steps which involves setting the lock in each instance of a Redis cluster but also making use of timestamps to establish a lock validity time. The lock validity time allows to automatically release the lock for example when an application crashes. Further, if a client failed to acquire a lock it will also attempt to unlock all other instances.

Redlock is a very elaborate algorithm but one thing it is not is simple. What I find concerning is the amount of I/O it requires. I assume it is necessary complexity but I would always strive for simplicity, from an algorithm perspective, implementation cost or to avoid additional dependencies.

Advisory Locks

I have found myself in need of a locking mechanism in a service-oriented environment that has a Redis cluster but also a Postgres database available. To avoid dependency between test data used in integration tests I wanted to ensure only one test at any given time can write to the data stores.

My problem with Redlock is the timeout used to establish lock validity. You never know what might stall your tests in a continuous integration service but you also want to avoid making the timeout too long.

Revisiting PostgreSQL’s Advisory Locks I was pleasantly surprised. The following Go code was sufficient to achieve what I needed:

if _, err := db.ExecContext(ctx, "SELECT pg_advisory_lock(0)"); err != nil {
  t.Fatal(err)
}
t.Cleanup(func() {
  if _, err := db.ExecContext(ctx, "SELECT pg_advisory_unlock(0)"); err != nil {
    t.Fatal(err)
  }
})

Other than table-level, row-level or page-level locks the semantic of advisory locks is defined by the application. The lock ownership is tied to the session and it supports nested locking. Any other session will block until the lock is released. Not only does this avoid additional dependencies, it also eliminates busy waiting to acquire the lock.

# EXPLAIN ANALYZE SELECT pg_advisory_lock(0);

                                     QUERY PLAN
------------------------------------------------------------------------------------
 Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=1)
 Planning Time: 0.044 ms
 Execution Time: 0.025 ms

Advisory locks are a lot faster than storing a flag in a table for the same purpose. Once the session terminates the lock will also be released which solves the problem around eventually releasing a lock acquired by an application that crashed. This makes it a great alternative in cases where the lock does not need to be available on several different machines.

Considerations

That said, depending on the circumstances it does not always make it a viable replacement for Redlock. PostgreSQL may not be available in your environment. The session-based locking requires to ensure that the same connection is used for locking and unlocking which is not guaranteed if connection pooling is involved. The automatic lock release works well if the application crashes but there can be other reasons an application does not end up releasing the lock resulting in deadlock or liveness issues.

What exact properties are needed will depend on the reason to have a locking mechanism in the first place. You should understand your requirements before proceeding with any solution. Distributed systems are not difficult because it is a domain with problems we have not found perfect solutions to yet but because the context and application of a distributed system will determine which problems will need to be solved and which trade-offs are acceptable.