We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.
I think the lock service and the storage have to be combined as one single service to resolve this tricky race condition.
Another workaround is the client sends back the hash of the whole original file before its modifications to the storage system. Then the storage system compares the hash with the current file hash to determine if the file has been modified between the client reading the file and sending a modified version to the storage.
the storage service is already validating the token somehow with lock service, so on top of that, additionally checking the TTL of current lock shouldn't be too hard. The latency is already there.
> Another workaround is the client sends back the hash of the whole original file before its modifications to the storage system.
Or just simply a (row)version - like in the optimistic concurrency check.
Hi Neil, good question. An obvious thing you could try is that the storage service can make a request to the lock service to check whether a token is still current, before accepting a write, although that will add latency. Other than that I don't really have a good answer. It might be a fundamental limitation because you have a race condition between the two clients, and in the absence of further information (such as a request to the lock service, or assumptions about clock synchronisation) the storage service cannot know whether a lease is still valid.
actually, validating the access token before the storage accepts the request is still not 100% reliable. Consider the case where the lock happen to expire and then be acquired by another client just after the validation passes. due to timing issue, the storage can then take concurrent writes from two clients at the same time.
This is because "validate the token" and "take the write" are not atomic operations and doesn't happen instantaneously.
I want to answer the two questions I asked earlier now that I have a better understanding of distributed systems.
"I'm interested in enforcing that the client with the expired lease is not allowed to write to storage at all even if it writes before the client holding the lease.
1. Are there any algorithms to enforce this?"
If you need to enforce this for correctness then you should use your database to do it. What you want is to transactionally acquire a lock and write to the database and then release the lock. Databases already have this functionality built in! Putting the lock in another service makes everything harder because of the problems outlined in this blog post.
I would ask why do you have this requirement? Is it really about the client or about some other property you want to enforce. It is tempting to lock more than you need to enforce a constraint, but what is the minimum you need to lock? Can you structure your database transaction and table(s) such that the database enforces this in an efficient manner? The answer is almost always yes.
"2. Are there any libraries or storage solutions that implement such an algorithm?"
I would recommend you read about the locking strategy your database uses. The database already acquires locks by itself when you use it. It does a good job locking only what is needed
Postgres h ttps://www.postgresql.org/docs/7.1/locking-tables.html
MySQL https://dev.mysql.com/doc/refman/8.4/en/innodb-locking.html
What if the Lock Service also responds with lock start time and expiration time to client 1. Client 1 in turn send it to storage system. Storage system just validates expiry time in future or not. Assumping absense of clock synchronization issue.
That only works if the lock service and the storage system have perfectly synchronised clocks. However, clock sync typically runs over NTP, which is subject to the same network delays as all other packets, and that delay causes the clocks to be slightly out of sync with each other. Even if the clock drift is on the order of milliseconds, that's not close enough for this to be safe.
"Instead, please use a proper consensus system such as ZooKeeper, probably."
I literally skimmed the entire article for this part. :-)
That said, it's a good writeup, and very informative. Thanks for sharing!
Hello.
But wouldn't this sort of fencing token scheme imply that the storage has a critical section where it can compare the current token with the last one it received from the clients? This critical section alone seems enough to avoid conflicts. Also, with this approach, the storage could simply store object versions, and reject operations that didn't specify the most recent version.
What I'm saying is, it sure looks like if it's possible to implement fenced updates in the storage at all, then you don't need a separate lock service anymore. What am I missing?
Well observed. I didn't want to go into too much detail in this already long post. There are situations in which a fencing token is still useful. For example, if there are several replicas of the storage service, then each replica can independently check the freshness of the fencing token without coordinating with the other replicas. If a quorum (e.g. 2 out of 3) of replicas accept a fencing token as valid, then a write is reported as successful to the client.
This situation is actually very similar to the second phase of Paxos or Raft (if you replace "fencing token" with "term number"), where a leader tries to replicate a value to a quorum of nodes. The first phase of Paxos/Raft is the leader election, which corresponds closely to acquiring a distributed lock. Consensus algorithms ensure that there is at most one leader for a given term number, and that term numbers monotonically increase over time. Being the leader in a particular term number corresponds to acquiring the lock with a particular fencing token.
Thanks, Martin! Makes sense, but it also means that while trying to come up with a distributed lock managing algorithm we ended up having nothing but another consensus algorithm:)
Yes, this is exactly what I expressed in the comments here about 2 years ago. Well, I am glad I am not the only one who sees it this way :)
After reading the article one conclusion can be derived i.e. it is specific to Redlock which has been implemented in Redisson library. But I would like to implement a lock from scratch using incrementing atomic counters in redis without setting the timeout .
Implementation would look like:
getLock(key) : increment the key by 1, if the value i get back is 1 then lock acquired otherwise some one has already taken the lock.
releaseLock(ey): delete the key.
is this implementation safe enough for correctness?
Not having a lock timeout means that any process that grabs the lock and becomes unavailable, makes the whole LM (lock manager) unusable. So if you do not have a timeout, then there is no sence in DLM (distributed LM) at all (though, there is no sense in DLM anyway...). Thus if you simply need an LM, then using PostgreSQL advisory session locks makes more sense than doing it with Redis, because PostgreSQL session locks are at least auto-unlocked if the holding session disconnects, thus an unavailable process holding the lock will not render a PostgreSQL-based LM unusable.
Hi Martin, now that Kafka 0.11 has support for transactions could we use it in some way shape or form for distributed locks? Perhaps using an event sourced pattern where an aggregate tracks who is locking it?
I have a particularly complex use-case where multiple resources are to be locked together for the purpose of a group-related operation, so I'm looking at many different options. Postgres seems like a safe bet, but if I can take advantage of Kafka transactions that would be great.
Thanks for the article!
Not knowing all the details of your requirements it's hard to give an unequivocal answer, but my guess is that Kafka transactions are irrelevant here. The main feature of Kafka transactions is to allow you to atomically publish messages to several topic-partitions. I don't see how that would help you implement a distributed lock.
I am not sure Kafka is really the right tool for this particular purpose, although if you really want to use it, then I would suggest routing all lock requests through a single topic-partition. The monotonically increasing offset that Kafka assigns to messages in a partition could then be used for fencing.
Thanks Martin! The last part you mentioned is along the lines of what I was thinking: Aggregates would each map to a single topic-partition. A client may then attempt to lock (for correctness) one or more aggregates in a transaction. The assumption is that having "exactly-once" guarantees would help with updating these aggregates together, for example updating their "lockedBy", "lockedAt" properties.
It's a long dated post, nonetheless it is particularly relevant today with the wide distribution of blockchain systems.
As it is stated, due to the fundamental asynchronicity in communication among peers, a cascade of problems arise when we have the introduction of lock leases that expire.
However, if we remove the expiring function of lock leases we run into the problem of a peer that successfully acquired a lock (by majority voting) never releasing the lease and holding onto the lock forever (whether by malicious intent, network problems, etc).
On the other hand, we may solve the problem of a lock being held indefinitely by making the majority vote for a forced release of the lock.
If the majority already voted for the forced release of a lock, and in the case that the peer that held the lock comes online and tries to write or release the lock afterwards the forced release, the operation will simply be rejected by the majority and the node that used to own the lock will revert any operations in the context of its acquired lock.
In that sense, the lock manager itself is distributed among the peers that by majority voting perform three distinct operations: grant lock, commit + lock release, revert + forced release.
Hi Martin (Martin Kleppmann ), it seems to me that in order for the storage system which is protected by a distributed lock manager (DLM) to be able to participate in fencing, it has to be able to resolve concurrent requests on its own, because it must support the following operation (yes, the whole thing is a single operation, essentially compare-and-set (CAS)):
read the last-seen-fencing-number & compare it with the incoming-fencing-number, store the incoming-fencing-number if it is greater than the last-seen-fencing-number or reject the request with the incoming-fencing-number otherwise.
If a storage system can’t do such an operation, then the system can’t benefit from fencing. But if the system can do this, then such a system is able to resolve concurrent requests on its own because it has CAS, and the DLM becomes completely redundant for correctness purposes for such a system. Am I crazy, or am I correct? :)
It would be very cool if we could clarify this.
Sorry for the bump.
Main use-case I can envisage: "best-effort transaction" support, over subsets of linearizable stores (plural!) that have CAS but not transactions.
During client x's lease, x could issue (potentially multiple) writes to (potentially multiple) linearizable stores participating in the arrangement. x could even extend their lease for a long-running "pseudo-transaction", if the network plays nice.
Atomicity is the kicker, though: x might need to live with only its first i of n writes executing with serial isolation, before its lease expires (say it's unable to renew it due to packet delay) - unless the next client y is forced to begin with a compensating transaction...
Note that another commenter brought the same exact point I did (search for "What I'm saying is, it sure looks like if it's possible to implement fenced updates in the storage at all, then you don't need a separate lock service anymore."), and even got a reply from Martin.
Great post! But I think a clarification has to be made regarding fencing tokens: What happens in the (unlikely) case that client 2 also suffers a STW pause and they write with increasing successive tokens?
The storage system simply maintains the ‘ratchet’ that the token can only stay the same or increase, but not decrease. Thus, if client 2 pauses and client 3 acquires the lock, client 2 will have a lesser token. If client 3 has already made a request to the storage service, client 1 and 2 will both be blocked.
what I mean is:
client 1 acquires lock
client 1 stops
client 1 lock expires
client 2 acquires lock
client 2 stops
client 1 resumes and writes to storage
client 2 resumes and writes to storage
This is more unlikely than the scenario you presented, but still possible, and breaks the desired "correctness" since both writes are accepted.
Oh, I see what you mean now. Yes, you have a good point — that scenario does look risky. To reason about it properly, I think we would need to make some assumptions about the semantics of the write that occurs under the lock; I think it would probably turn out to be safe in some cases, but unsafe in others. I will think about it some more.
In consensus algorithms that use epochs/ballot numbers/round numbers (which have a similar function to the fencing token), the algorithm works because the type of write is constrained. Thus, Paxos for example can maintain the invariant that if one node decides x, no other node will decide a value other than x, even in the presence of arbitrary pauses. If unconstrained writes were allowed, the safety property could be violated.
Perhaps it would be useful to regard a storage system with fencing token support as participating in an approximation of a consensus algorithm, but further protocol constraints would be required to make it safe in all circumstances?
After some thinking, this is how I would do it:
In lock manager:
- If since last released lock, no other (later) has been expired, then next returned token is an "ordinary token" (incrementing the previous one)
- Otherwise, the next returned token is a "paired token" containing major/minor information, being major: the current token numbering, and minor: the numbering of the first token not released at this time
In lock-aware resources:
- Keep record of the highest accepted token
- If the current token is ordinary then behave as usual (rejecting when it's not greater than the highest)
- If the current token is paired (granted after some others expiration) then accept only if its minor number is greater than highest known
This would be consistent with my previous example:
-client 1 acquires lock (token: "1")
-client 1 stops
-client 1 lock expires
-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")
-client 2 stops
-client 1 resumes and writes to storage
-storage accepts token and sets the highest known to "1"
-client 2 resumes and writes to storage
-storage rejects token "2:1" since "1" is not greater than highest ("1")
what do you think?
I think your method does not work:
-client 1 acquires lock (token: "1")
-client 1 stops
-client 1 lock expires
-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")
-client 2 do not write storage
-client 2 released lock
-client 3 acquires lock (token: "3")
-client 3 stops
-client 3 lock expires
-client 1 resumes and writes to storage
-storage accepts token and sets the highest known to "1"
-client 3 resumes and writes to storage
-storage accepts token
This approach means that the storage would reject all requests after the expiration of lock (token: 1) except for the client 1 requests until client 1 explicitly goes and releases the expired lock (token: 1). And this would eliminate the lock expiration idea: despite a lock can expire, the whole system (lock management + the storage) still must wait for it to be released if the owner of the lock made at least one request to the storage.
No, this approach doesn't mean that.
First, let's clarify this idea was made in the context of fencing tokens (locking without making any timing assumptions, accepting a token only based on its value and the value of the previous accepted one).
Once the expiration of the first lock would occur, the locking system would give the lock to client 2 (token 2), but whatever comes first at the write operation would succeed. If 2 comes first, 1 is disabled automatically by the ordinary case (monotonically increasing verification), but if 1 comes first, 2 would be disabled by the new "paired" case
Interesting idea. Seems plausible at first glance, but it's the kind of subtle protocol that would benefit from a formal proof of correctness. In these distributed systems topics it's terribly easy to accidentally overlook some edge-case.
This actually looks a lot like the algorithm proposed in the paper "On Optimistic Methods for Concurrency Control" by Kung and Robinson, 1981 (http://www.eecs.harvard.edu...
I believe that this paper addresses the exact issues that idelvall mentioned and also includes a formal proof. Additionally, in the case proposed as long as client 1 and client 2 have no conflicts in what they are writing then both would still be permitted, however in this case if there was a conflict then client 2 would be rejected prior to starting its write. Would be interested in hearing your thoughts on this though.
Jeff Jeff
hi, the link is broken
Agreed, just an idea. I'll take a look into Chubby's paper and see how they handle this
Chubby solved this problem by introuducing sequencer.
Quote from 2.4:
At any time, a lock holder may request a sequencer , an opaque byte-string that describes the state of the lock immediately after acquisition. It contains the name of the lock, the mode in which it was acquired(exclusive or shared), and the lock generation number. The client passes the sequencer to servers(such as file servers) if it expectes the operation to be protected by the lock. The recipient server is expected to test whether the sequencer is still valid and has the appropriate mode; if not, it should reject the request. The validity of a sequencer can be cheched against the server's Chubby cache or, if the server does not wish to maintain a session with Chubby, against the most recent sequencer that the server has observed.
In my opinion, it is only safe to check with Chubby cache, or it suffers the same problem.
And Chubby paper mentions another imperfect solution, lock-delay, which says, if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a lock-delay(1 minute) period.
Note for the readers: that there is an error in the way the Redlock algorithm is used in the blog post: the final step after the majority is acquired, is to check if the total time elapsed is already over the lock TTL, and in such a case the client does not consider the lock as valid. This makes Redlock immune from client <-> lock-server delays in the messages, and makes every other delay *after* the lock validity is tested as any other GC pause during the processing of the locked resource. This is also equivalent to what happens, when using a remote lock server, if the "OK, you have the lock" reply from the server remains in the kernel buffers since the socket pauses before reading it. So where in this blog post its assumed that network delays or GC pauses during the lock acquisition stage are a problem, there is an error.
This is correct, I had overlooked that additional clock check after messages are received. However, I believe that the additional check does not substantially alter the properties of the algorithm:
- Large network delay between the application and the shared resource (the thing that is protected by the lock) can still cause the resource to receive a message from an application process that no longer holds the lock, so fencing is still required.
- A GC pause between the final clock check and the resource access will not be caught by the clock check. As I point out in the article: "Remember that GC can pause a running thread at any point, including the point that is maximally inconvenient for you".
- All the dependencies on accuracy of clock measurement still hold.
Hello Martin, thanks for your reply. Network delays between the app and the shared resource, and a GC pause *after* the check, but before doing the actual work, are all conceptually exactly the same as the "point 1" of your argument, that is, GC pauses (or other pauses) make the algo require an incremental token. So, regarding the safety of the algorithm itself, the only remaining thing would be the dependency on clock drifts, that can be argued depending on point of view. So I'm sorry to have to say that IMHO the current version of the article, by showing the wrong implementation of the algorithm, and not citing the equivalence of GC pauses processing the shared resource, with GC pauses immediately after the token is acquired, does not provide a fair picture.
Bottom line is for an application programmer like me this implementation looks doubtful enough not to use it in production systems. If this does not work perfectly then can introduce bugs which will be impossible to fix.
Today, I learn and think about the "redlock:. I am very agree you!
So basically you're proposing to use optimistic locking with a token instead of pessimistic locking.
The obvious question is then: why use a lock at all? It seems you could just use a token generator service + a storage service check and have same guarantees.
There is a typo in 'The man page for gettimeofday explicitly says that the time '
man -> main
manual
It is man page actually. https://man7.org/linux/man-pages/man2/gettimeofday.2.html
This is an interesting topic, incidentally, I just released an article about achieving distributed locking using Redis regular IO (read & set).
Please check it out here for details: https://www.linkedin.com/pulse/master-less-cluster-wide-resource-locking-gerardo-recinto-g4mec/?trackingId=%2FRvVnuf4TwirkXACC4TYCw%3D%3D
Enjoy! :)
Hi, Martin!
Thanks for the article. The fencing token approach is very helpful in case of correctness, but the main problem (as for me) a storage should support tokens validation. For example, if I have S3, and I want to store a file only from leader, how vanilla S3 implementation could check the correctness of a token. There's no a conditional put in API. Seems we need some kind of proxy on top of S3 which would handle put requests with tokens?
Why the heck title is how-to-do-distributed-locking
When whole article is about how Redlock is not a good choice.
My use case is to update multiple microservices, with different stores of their own. This is for correctness purpose, therefore consistency and reliability is utmost priority. And because there are multiple datastore/entity updates involved, fencing token doesn't solve the problem.
I'm considering to use relational store such as MySQL or DynamoDB with lock entity. However handling rollbacks, etc for failure are bringing their own complexity.
Do you recommend any better approach for such a scenario?
Hi Martin,
Thank you for your post.
The fencing algorithm would seem to allow Client 1 to write without holding the lease if it wrote with token 33 before Client 2 wrote with token 34 (I made a crude diagram below). I'm interested in enforcing that the client with the expired lease is not allowed to write to storage at all even if it writes before the client holding the lease.
1. Are there any algorithms to enforce this?
2. Are there any libraries or storage solutions that implement such an algorithm?
Thank you for your time.
https://uploads.disquscdn.c...