We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Neil Lugovoy • 1 year ago

Hi Martin,

Thank you for your post.

The fencing algorithm would seem to allow Client 1 to write without holding the lease if it wrote with token 33 before Client 2 wrote with token 34 (I made a crude diagram below). I'm interested in enforcing that the client with the expired lease is not allowed to write to storage at all even if it writes before the client holding the lease.

1. Are there any algorithms to enforce this?
2. Are there any libraries or storage solutions that implement such an algorithm?

Thank you for your time.

https://uploads.disquscdn.c...

HC • 1 year ago

I think the lock service and the storage have to be combined as one single service to resolve this tricky race condition.
Another workaround is the client sends back the hash of the whole original file before its modifications to the storage system. Then the storage system compares the hash with the current file hash to determine if the file has been modified between the client reading the file and sending a modified version to the storage.

Kobe W • 1 month ago

the storage service is already validating the token somehow with lock service, so on top of that, additionally checking the TTL of current lock shouldn't be too hard. The latency is already there.

Andrey Tamelo • 11 months ago

> Another workaround is the client sends back the hash of the whole original file before its modifications to the storage system.

Or just simply a (row)version - like in the optimistic concurrency check.

Martin Kleppmann • 1 year ago

Hi Neil, good question. An obvious thing you could try is that the storage service can make a request to the lock service to check whether a token is still current, before accepting a write, although that will add latency. Other than that I don't really have a good answer. It might be a fundamental limitation because you have a race condition between the two clients, and in the absence of further information (such as a request to the lock service, or assumptions about clock synchronisation) the storage service cannot know whether a lease is still valid.

Kobe W • 1 month ago

actually, validating the access token before the storage accepts the request is still not 100% reliable. Consider the case where the lock happen to expire and then be acquired by another client just after the validation passes. due to timing issue, the storage can then take concurrent writes from two clients at the same time.

This is because "validate the token" and "take the write" are not atomic operations and doesn't happen instantaneously.

Sambit Bharimalla • 1 week ago

What if the Lock Service also responds with lock start time and expiration time to client 1. Client 1 in turn send it to storage system. Storage system just validates expiry time in future or not. Assumping absense of clock synchronization issue.

Martin Kleppmann • 1 week ago

That only works if the lock service and the storage system have perfectly synchronised clocks. However, clock sync typically runs over NTP, which is subject to the same network delays as all other packets, and that delay causes the clocks to be slightly out of sync with each other. Even if the clock drift is on the order of milliseconds, that's not close enough for this to be safe.

Douglas Muth • 8 years ago

"Instead, please use a proper consensus system such as ZooKeeper, probably."

I literally skimmed the entire article for this part. :-)

That said, it's a good writeup, and very informative. Thanks for sharing!

Guest • 3 years ago

Hello.

But wouldn't this sort of fencing token scheme imply that the storage has a critical section where it can compare the current token with the last one it received from the clients? This critical section alone seems enough to avoid conflicts. Also, with this approach, the storage could simply store object versions, and reject operations that didn't specify the most recent version.

What I'm saying is, it sure looks like if it's possible to implement fenced updates in the storage at all, then you don't need a separate lock service anymore. What am I missing?

Martin Kleppmann • 3 years ago

Well observed. I didn't want to go into too much detail in this already long post. There are situations in which a fencing token is still useful. For example, if there are several replicas of the storage service, then each replica can independently check the freshness of the fencing token without coordinating with the other replicas. If a quorum (e.g. 2 out of 3) of replicas accept a fencing token as valid, then a write is reported as successful to the client.

This situation is actually very similar to the second phase of Paxos or Raft (if you replace "fencing token" with "term number"), where a leader tries to replicate a value to a quorum of nodes. The first phase of Paxos/Raft is the leader election, which corresponds closely to acquiring a distributed lock. Consensus algorithms ensure that there is at most one leader for a given term number, and that term numbers monotonically increase over time. Being the leader in a particular term number corresponds to acquiring the lock with a particular fencing token.

Valentin Kovalenko • 3 years ago

Thanks, Martin! Makes sense, but it also means that while trying to come up with a distributed lock managing algorithm we ended up having nothing but another consensus algorithm:)

Valentin Kovalenko • 3 years ago

Yes, this is exactly what I expressed in the comments here about 2 years ago. Well, I am glad I am not the only one who sees it this way :)

Arun Chouhan • 4 years ago

After reading the article one conclusion can be derived i.e. it is specific to Redlock which has been implemented in Redisson library. But I would like to implement a lock from scratch using incrementing atomic counters in redis without setting the timeout .

Implementation would look like:
getLock(key) : increment the key by 1, if the value i get back is 1 then lock acquired otherwise some one has already taken the lock.

releaseLock(ey): delete the key.

is this implementation safe enough for correctness?

Valentin Kovalenko • 4 years ago

Not having a lock timeout means that any process that grabs the lock and becomes unavailable, makes the whole LM (lock manager) unusable. So if you do not have a timeout, then there is no sence in DLM (distributed LM) at all (though, there is no sense in DLM anyway...). Thus if you simply need an LM, then using PostgreSQL advisory session locks makes more sense than doing it with Redis, because PostgreSQL session locks are at least auto-unlocked if the holding session disconnects, thus an unavailable process holding the lock will not render a PostgreSQL-based LM unusable.

climbsocial • 6 years ago

Hi Martin, now that Kafka 0.11 has support for transactions could we use it in some way shape or form for distributed locks? Perhaps using an event sourced pattern where an aggregate tracks who is locking it?

I have a particularly complex use-case where multiple resources are to be locked together for the purpose of a group-related operation, so I'm looking at many different options. Postgres seems like a safe bet, but if I can take advantage of Kafka transactions that would be great.

Thanks for the article!

Martin Kleppmann • 6 years ago

Not knowing all the details of your requirements it's hard to give an unequivocal answer, but my guess is that Kafka transactions are irrelevant here. The main feature of Kafka transactions is to allow you to atomically publish messages to several topic-partitions. I don't see how that would help you implement a distributed lock.

I am not sure Kafka is really the right tool for this particular purpose, although if you really want to use it, then I would suggest routing all lock requests through a single topic-partition. The monotonically increasing offset that Kafka assigns to messages in a partition could then be used for fencing.

climbsocial • 6 years ago

Thanks Martin! The last part you mentioned is along the lines of what I was thinking: Aggregates would each map to a single topic-partition. A client may then attempt to lock (for correctness) one or more aggregates in a transaction. The assumption is that having "exactly-once" guarantees would help with updating these aggregates together, for example updating their "lockedBy", "lockedAt" properties.

Rafael Werlang • 2 years ago

It's a long dated post, nonetheless it is particularly relevant today with the wide distribution of blockchain systems.
As it is stated, due to the fundamental asynchronicity in communication among peers, a cascade of problems arise when we have the introduction of lock leases that expire.
However, if we remove the expiring function of lock leases we run into the problem of a peer that successfully acquired a lock (by majority voting) never releasing the lease and holding onto the lock forever (whether by malicious intent, network problems, etc).
On the other hand, we may solve the problem of a lock being held indefinitely by making the majority vote for a forced release of the lock.
If the majority already voted for the forced release of a lock, and in the case that the peer that held the lock comes online and tries to write or release the lock afterwards the forced release, the operation will simply be rejected by the majority and the node that used to own the lock will revert any operations in the context of its acquired lock.
In that sense, the lock manager itself is distributed among the peers that by majority voting perform three distinct operations: grant lock, commit + lock release, revert + forced release.

Valentin Kovalenko • 5 years ago

Hi Martin (Martin Kleppmann ), it seems to me that in order for the storage system which is protected by a distributed lock manager (DLM) to be able to participate in fencing, it has to be able to resolve concurrent requests on its own, because it must support the following operation (yes, the whole thing is a single operation, essentially compare-and-set (CAS)):

read the last-seen-fencing-number & compare it with the incoming-fencing-number, store the incoming-fencing-number if it is greater than the last-seen-fencing-number or reject the request with the incoming-fencing-number otherwise.

If a storage system can’t do such an operation, then the system can’t benefit from fencing. But if the system can do this, then such a system is able to resolve concurrent requests on its own because it has CAS, and the DLM becomes completely redundant for correctness purposes for such a system. Am I crazy, or am I correct? :)

It would be very cool if we could clarify this.

idelvall • 7 years ago

Great post! But I think a clarification has to be made regarding fencing tokens: What happens in the (unlikely) case that client 2 also suffers a STW pause and they write with increasing successive tokens?

Martin Kleppmann • 7 years ago

The storage system simply maintains the ‘ratchet’ that the token can only stay the same or increase, but not decrease. Thus, if client 2 pauses and client 3 acquires the lock, client 2 will have a lesser token. If client 3 has already made a request to the storage service, client 1 and 2 will both be blocked.

idelvall • 7 years ago

what I mean is:
client 1 acquires lock
client 1 stops
client 1 lock expires
client 2 acquires lock
client 2 stops
client 1 resumes and writes to storage
client 2 resumes and writes to storage

This is more unlikely than the scenario you presented, but still possible, and breaks the desired "correctness" since both writes are accepted.

Martin Kleppmann • 7 years ago

Oh, I see what you mean now. Yes, you have a good point — that scenario does look risky. To reason about it properly, I think we would need to make some assumptions about the semantics of the write that occurs under the lock; I think it would probably turn out to be safe in some cases, but unsafe in others. I will think about it some more.

In consensus algorithms that use epochs/ballot numbers/round numbers (which have a similar function to the fencing token), the algorithm works because the type of write is constrained. Thus, Paxos for example can maintain the invariant that if one node decides x, no other node will decide a value other than x, even in the presence of arbitrary pauses. If unconstrained writes were allowed, the safety property could be violated.

Perhaps it would be useful to regard a storage system with fencing token support as participating in an approximation of a consensus algorithm, but further protocol constraints would be required to make it safe in all circumstances?

idelvall • 7 years ago

After some thinking, this is how I would do it:

In lock manager:
- If since last released lock, no other (later) has been expired, then next returned token is an "ordinary token" (incrementing the previous one)
- Otherwise, the next returned token is a "paired token" containing major/minor information, being major: the current token numbering, and minor: the numbering of the first token not released at this time

In lock-aware resources:
- Keep record of the highest accepted token
- If the current token is ordinary then behave as usual (rejecting when it's not greater than the highest)
- If the current token is paired (granted after some others expiration) then accept only if its minor number is greater than highest known

This would be consistent with my previous example:
-client 1 acquires lock (token: "1")
-client 1 stops
-client 1 lock expires
-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")
-client 2 stops
-client 1 resumes and writes to storage
-storage accepts token and sets the highest known to "1"
-client 2 resumes and writes to storage
-storage rejects token "2:1" since "1" is not greater than highest ("1")

what do you think?

cquliaoli • 4 years ago

I think your method does not work:
-client 1 acquires lock (token: "1")
-client 1 stops
-client 1 lock expires
-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")
-client 2 do not write storage
-client 2 released lock
-client 3 acquires lock (token: "3")
-client 3 stops
-client 3 lock expires
-client 1 resumes and writes to storage
-storage accepts token and sets the highest known to "1"
-client 3 resumes and writes to storage
-storage accepts token

Valentin Kovalenko • 5 years ago

This approach means that the storage would reject all requests after the expiration of lock (token: 1) except for the client 1 requests until client 1 explicitly goes and releases the expired lock (token: 1). And this would eliminate the lock expiration idea: despite a lock can expire, the whole system (lock management + the storage) still must wait for it to be released if the owner of the lock made at least one request to the storage.

idelvall • 5 years ago

No, this approach doesn't mean that.

First, let's clarify this idea was made in the context of fencing tokens (locking without making any timing assumptions, accepting a token only based on its value and the value of the previous accepted one).

Once the expiration of the first lock would occur, the locking system would give the lock to client 2 (token 2), but whatever comes first at the write operation would succeed. If 2 comes first, 1 is disabled automatically by the ordinary case (monotonically increasing verification), but if 1 comes first, 2 would be disabled by the new "paired" case

Martin Kleppmann • 7 years ago

Interesting idea. Seems plausible at first glance, but it's the kind of subtle protocol that would benefit from a formal proof of correctness. In these distributed systems topics it's terribly easy to accidentally overlook some edge-case.

Jeff Jeff • 7 years ago

This actually looks a lot like the algorithm proposed in the paper "On Optimistic Methods for Concurrency Control" by Kung and Robinson, 1981 (http://www.eecs.harvard.edu...

I believe that this paper addresses the exact issues that idelvall mentioned and also includes a formal proof. Additionally, in the case proposed as long as client 1 and client 2 have no conflicts in what they are writing then both would still be permitted, however in this case if there was a conflict then client 2 would be rejected prior to starting its write. Would be interested in hearing your thoughts on this though.

AMIT JAIN • 4 years ago
Adrian Petre • 5 years ago

Jeff Jeff
hi, the link is broken

idelvall • 7 years ago

Agreed, just an idea. I'll take a look into Chubby's paper and see how they handle this

xingdl2007 • 3 years ago

Chubby solved this problem by introuducing sequencer.

Quote from 2.4:
At any time, a lock holder may request a sequencer , an opaque byte-string that describes the state of the lock immediately after acquisition. It contains the name of the lock, the mode in which it was acquired(exclusive or shared), and the lock generation number. The client passes the sequencer to servers(such as file servers) if it expectes the operation to be protected by the lock. The recipient server is expected to test whether the sequencer is still valid and has the appropriate mode; if not, it should reject the request. The validity of a sequencer can be cheched against the server's Chubby cache or, if the server does not wish to maintain a session with Chubby, against the most recent sequencer that the server has observed.

In my opinion, it is only safe to check with Chubby cache, or it suffers the same problem.
And Chubby paper mentions another imperfect solution, lock-delay, which says, if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a lock-delay(1 minute) period.

antirez • 8 years ago

Note for the readers: that there is an error in the way the Redlock algorithm is used in the blog post: the final step after the majority is acquired, is to check if the total time elapsed is already over the lock TTL, and in such a case the client does not consider the lock as valid. This makes Redlock immune from client <-> lock-server delays in the messages, and makes every other delay *after* the lock validity is tested as any other GC pause during the processing of the locked resource. This is also equivalent to what happens, when using a remote lock server, if the "OK, you have the lock" reply from the server remains in the kernel buffers since the socket pauses before reading it. So where in this blog post its assumed that network delays or GC pauses during the lock acquisition stage are a problem, there is an error.

Martin Kleppmann • 8 years ago

This is correct, I had overlooked that additional clock check after messages are received. However, I believe that the additional check does not substantially alter the properties of the algorithm:

- Large network delay between the application and the shared resource (the thing that is protected by the lock) can still cause the resource to receive a message from an application process that no longer holds the lock, so fencing is still required.

- A GC pause between the final clock check and the resource access will not be caught by the clock check. As I point out in the article: "Remember that GC can pause a running thread at any point, including the point that is maximally inconvenient for you".

- All the dependencies on accuracy of clock measurement still hold.

antirez • 8 years ago

Hello Martin, thanks for your reply. Network delays between the app and the shared resource, and a GC pause *after* the check, but before doing the actual work, are all conceptually exactly the same as the "point 1" of your argument, that is, GC pauses (or other pauses) make the algo require an incremental token. So, regarding the safety of the algorithm itself, the only remaining thing would be the dependency on clock drifts, that can be argued depending on point of view. So I'm sorry to have to say that IMHO the current version of the article, by showing the wrong implementation of the algorithm, and not citing the equivalence of GC pauses processing the shared resource, with GC pauses immediately after the token is acquired, does not provide a fair picture.

MarutSingh • 7 years ago

Bottom line is for an application programmer like me this implementation looks doubtful enough not to use it in production systems. If this does not work perfectly then can introduce bugs which will be impossible to fix.

李涛 • 6 years ago

Today, I learn and think about the "redlock:. I am very agree you!

Raviraj • 5 months ago

Why the heck title is how-to-do-distributed-locking
When whole article is about how Redlock is not a good choice.

Kiran Biliyawala • 1 year ago

My use case is to update multiple microservices, with different stores of their own. This is for correctness purpose, therefore consistency and reliability is utmost priority. And because there are multiple datastore/entity updates involved, fencing token doesn't solve the problem.
I'm considering to use relational store such as MySQL or DynamoDB with lock entity. However handling rollbacks, etc for failure are bringing their own complexity.
Do you recommend any better approach for such a scenario?

Sarfaraz Nawaz • 1 year ago

Instead of redlock returning fencing/counter, what if the counter is generated/incremented/managed on the clients or on the storage itself. For example, a client reads the data and the associated counter, and changes the data, and attempts to write back the changed data along with the counter.. AND the write is successful ONLY IF the counter received from the client is same as the counter stored on the storage. We only need to make sure that every successful write must increment the counter as well, which invalidates all previously generated counters.

lastprincess • 2 years ago

Document management software helps to store, access, manage, control, and track digital documents and electronic images of paper-based information that has been captured through document scanning technology, or ingested as a digital document.

Ted • 2 years ago

I wrote this on the counter-analysis, and wanted to post it here too:

I am no algorithm expert, but, it seems to me that RedLock could use polling, to keep the lock up to date, as long as the process has the lock and is running. As long as that is true, the timestamp could be updated, and the expiry (in Redis) could be updated on the key, and thus, it could be determined if it is alive or not. Every time RedLock updates the lock key in Redis, it updates the locks Expiry time with 2*pollingInterval (or something like that). No deadlocks, because if the process dies or looses network connectivity, the key will expire shortly thereafter.

Yes, it adds overhead, network traffic, but then the "expiryTime" woul hardly be necessary, unless you want to set a hard upper limit.

DN • 3 years ago

Hi Martin Kleppmann thanks for an excellent primer on the pitfalls of distributed locking.
As an application developer who was thus far unaware of such subtleties of distributed locking: I have often relied on Postgres row locks (which is a lease really when you take the idle client session timeout into account) to implement mutual exclusion for "efficiency" use cases for the most part but also for "correctness" occasionally. I am not so sure anymore if this was such a good idea.
I wonder if you have insight into the implementation of postgres row locks and if they can serve as a good (enough) drop in replacement for some of your recommended systems (i.e. Zookeeper) for the "correctness" use case?
I have mainly used SELECT ... FOR UPDATE for mutual exclusion in addition to sometimes using the SELECT ... FOR UPDATE SKIP LOCKED for implementing work stealing schedulers. I am assuming same semantics are extended to postgres advisory locks too.
At a first glance it seems like fencing should be built in considering all postgres locks are issued in transactional scope, although I imagine a lot of that depends on the configured transaction isolation level.
Additionally considering that it's a relational database with master-slave replication, my gut feeling is that that should imply an implicit consensus but I would really appreciate if you could spare a thought on this topic.

Martin Kleppmann • 3 years ago

It depends what data/resource is being protected by the lock. If that resource is in Postgres itself (in the same Postgres instance), then it seems likely that the lock in Postgres is sufficient for protecting that resource from concurrent access. (You may need to dial the isolation level up to serializable to be sure, depending on the type of data access you're doing.)

However, if the resource you're protecting is outside of Postgres, you're back in the situation of needing a fencing token. I don't think that a SELECT ... FOR UPDATE lock returns any value that could be used for fencing, but you can probably construct a token generator yourself: put an integer counter value in the row that you're locking with SELECT ... FOR UPDATE, and increment it every time you acquire the lock.

One complicating detail is that if a client's connection to the database is interrupted, the transaction held by that client will eventually time out and be aborted, so any counter increments made within that transaction will be discarded. That would allow multiple clients to potentially obtain the same counter value, which would violate the requirements on the fencing token. You'd need to find some way of incrementing the counter that is not rolled back on failure. Maybe a PostgreSQL sequence object would be suitable, since IIRC its increments don't get rolled back on abort.

If you're doing failover on Postgres node failure, you need to make sure that the new leader is one that was synchronously replicated from the old leader, so that the new leader has seen all of the increments performed by the old leader.

Thomas Dean • 1 year ago

Martin Kleppmann I am just wondering why not just implement ETags and remove the requirement for a lock?

Yegor Bugayenko • 4 years ago

You may also find www.stateful.co interesting (I'm a developer), which provided free hosted locks for distributed applications. I'm using it in a number of multi-server apps.

Ben • 5 years ago

Can someone explain to me how distributed locks work at all? Let's assume that 2 seperate people on the internet ask two seperate loadbalanced webservers to solve the same problem at the exact same milisecond. Well, both of the web servers ask for a lock at the exact same milisecond. With latency, how can you be certain the lock will be distributed correctly to only one of the resources?

if A asks for a lock and redis sees there is no lock set
and B asks for a lock and redis sees there is no lock set
wouldn't it try to give both of them the lock?

I guess this is where the delay comes in..force a delayed response to ensure you only give the lock to one of them?

Sorry if it sounds noob but I'd love to understand.

I have been solving concurrency problems by pushing all of my actions I want performed into a queue and having a single consumer go through them in order. It works pretty well, but obviously it is relatively slow.

Adrian Petre • 5 years ago

Please respond to the comment http://disq.us/p/1w33vj3 on Is Redlock safe? (http://antirez.com/news/101),

Thank you!