We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.
After reading the article one conclusion can be derived i.e. it is specific to Redlock which has been implemented in Redisson library. But I would like to implement a lock from scratch using incrementing atomic counters in redis without setting the timeout .
Implementation would look like:getLock(key) : increment the key by 1, if the value i get back is 1 then lock acquired otherwise some one has already taken the lock.
releaseLock(ey): delete the key.
is this implementation safe enough for correctness?
Not having a lock timeout means that any process that grabs the lock and becomes unavailable, makes the whole LM (lock manager) unusable. So if you do not have a timeout, then there is no sence in DLM (distributed LM) at all (though, there is no sense in DLM anyway...). Thus if you simply need an LM, then using PostgreSQL advisory session locks makes more sense than doing it with Redis, because PostgreSQL session locks are at least auto-unlocked if the holding session disconnects, thus an unavailable process holding the lock will not render a PostgreSQL-based LM unusable.
Hi Martin, now that Kafka 0.11 has support for transactions could we use it in some way shape or form for distributed locks? Perhaps using an event sourced pattern where an aggregate tracks who is locking it?
I have a particularly complex use-case where multiple resources are to be locked together for the purpose of a group-related operation, so I'm looking at many different options. Postgres seems like a safe bet, but if I can take advantage of Kafka transactions that would be great.
Thanks for the article!
Not knowing all the details of your requirements it's hard to give an unequivocal answer, but my guess is that Kafka transactions are irrelevant here. The main feature of Kafka transactions is to allow you to atomically publish messages to several topic-partitions. I don't see how that would help you implement a distributed lock.
I am not sure Kafka is really the right tool for this particular purpose, although if you really want to use it, then I would suggest routing all lock requests through a single topic-partition. The monotonically increasing offset that Kafka assigns to messages in a partition could then be used for fencing.
Thanks Martin! The last part you mentioned is along the lines of what I was thinking: Aggregates would each map to a single topic-partition. A client may then attempt to lock (for correctness) one or more aggregates in a transaction. The assumption is that having "exactly-once" guarantees would help with updating these aggregates together, for example updating their "lockedBy", "lockedAt" properties.
Great post! But I think a clarification has to be made regarding fencing tokens: What happens in the (unlikely) case that client 2 also suffers a STW pause and they write with increasing successive tokens?
The storage system simply maintains the ‘ratchet’ that the token can only stay the same or increase, but not decrease. Thus, if client 2 pauses and client 3 acquires the lock, client 2 will have a lesser token. If client 3 has already made a request to the storage service, client 1 and 2 will both be blocked.
what I mean is:client 1 acquires lockclient 1 stopsclient 1 lock expiresclient 2 acquires lockclient 2 stopsclient 1 resumes and writes to storageclient 2 resumes and writes to storage
This is more unlikely than the scenario you presented, but still possible, and breaks the desired "correctness" since both writes are accepted.
Oh, I see what you mean now. Yes, you have a good point — that scenario does look risky. To reason about it properly, I think we would need to make some assumptions about the semantics of the write that occurs under the lock; I think it would probably turn out to be safe in some cases, but unsafe in others. I will think about it some more.
In consensus algorithms that use epochs/ballot numbers/round numbers (which have a similar function to the fencing token), the algorithm works because the type of write is constrained. Thus, Paxos for example can maintain the invariant that if one node decides x, no other node will decide a value other than x, even in the presence of arbitrary pauses. If unconstrained writes were allowed, the safety property could be violated.
Perhaps it would be useful to regard a storage system with fencing token support as participating in an approximation of a consensus algorithm, but further protocol constraints would be required to make it safe in all circumstances?
After some thinking, this is how I would do it:
In lock manager:- If since last released lock, no other (later) has been expired, then next returned token is an "ordinary token" (incrementing the previous one)- Otherwise, the next returned token is a "paired token" containing major/minor information, being major: the current token numbering, and minor: the numbering of the first token not released at this time
In lock-aware resources:- Keep record of the highest accepted token- If the current token is ordinary then behave as usual (rejecting when it's not greater than the highest)- If the current token is paired (granted after some others expiration) then accept only if its minor number is greater than highest known
This would be consistent with my previous example:-client 1 acquires lock (token: "1")-client 1 stops-client 1 lock expires-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")-client 2 stops-client 1 resumes and writes to storage-storage accepts token and sets the highest known to "1"-client 2 resumes and writes to storage-storage rejects token "2:1" since "1" is not greater than highest ("1")
what do you think?
I think your method does not work:-client 1 acquires lock (token: "1")-client 1 stops-client 1 lock expires-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")-client 2 do not write storage-client 2 released lock-client 3 acquires lock (token: "3")-client 3 stops-client 3 lock expires-client 1 resumes and writes to storage-storage accepts token and sets the highest known to "1"-client 3 resumes and writes to storage-storage accepts token
This approach means that the storage would reject all requests after the expiration of lock (token: 1) except for the client 1 requests until client 1 explicitly goes and releases the expired lock (token: 1). And this would eliminate the lock expiration idea: despite a lock can expire, the whole system (lock management + the storage) still must wait for it to be released if the owner of the lock made at least one request to the storage.
No, this approach doesn't mean that.
First, let's clarify this idea was made in the context of fencing tokens (locking without making any timing assumptions, accepting a token only based on its value and the value of the previous accepted one).
Once the expiration of the first lock would occur, the locking system would give the lock to client 2 (token 2), but whatever comes first at the write operation would succeed. If 2 comes first, 1 is disabled automatically by the ordinary case (monotonically increasing verification), but if 1 comes first, 2 would be disabled by the new "paired" case
Interesting idea. Seems plausible at first glance, but it's the kind of subtle protocol that would benefit from a formal proof of correctness. In these distributed systems topics it's terribly easy to accidentally overlook some edge-case.
This actually looks a lot like the algorithm proposed in the paper "On Optimistic Methods for Concurrency Control" by Kung and Robinson, 1981 (http://www.eecs.harvard.edu...
I believe that this paper addresses the exact issues that idelvall mentioned and also includes a formal proof. Additionally, in the case proposed as long as client 1 and client 2 have no conflicts in what they are writing then both would still be permitted, however in this case if there was a conflict then client 2 would be rejected prior to starting its write. Would be interested in hearing your thoughts on this though.
Jeff Jeffhi, the link is broken
Agreed, just an idea. I'll take a look into Chubby's paper and see how they handle this
Chubby solved this problem by introuducing sequencer.
Quote from 2.4:At any time, a lock holder may request a sequencer , an opaque byte-string that describes the state of the lock immediately after acquisition. It contains the name of the lock, the mode in which it was acquired(exclusive or shared), and the lock generation number. The client passes the sequencer to servers(such as file servers) if it expectes the operation to be protected by the lock. The recipient server is expected to test whether the sequencer is still valid and has the appropriate mode; if not, it should reject the request. The validity of a sequencer can be cheched against the server's Chubby cache or, if the server does not wish to maintain a session with Chubby, against the most recent sequencer that the server has observed.
In my opinion, it is only safe to check with Chubby cache, or it suffers the same problem.And Chubby paper mentions another imperfect solution, lock-delay, which says, if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a lock-delay(1 minute) period.
Note for the readers: that there is an error in the way the Redlock algorithm is used in the blog post: the final step after the majority is acquired, is to check if the total time elapsed is already over the lock TTL, and in such a case the client does not consider the lock as valid. This makes Redlock immune from client <-> lock-server delays in the messages, and makes every other delay *after* the lock validity is tested as any other GC pause during the processing of the locked resource. This is also equivalent to what happens, when using a remote lock server, if the "OK, you have the lock" reply from the server remains in the kernel buffers since the socket pauses before reading it. So where in this blog post its assumed that network delays or GC pauses during the lock acquisition stage are a problem, there is an error.
This is correct, I had overlooked that additional clock check after messages are received. However, I believe that the additional check does not substantially alter the properties of the algorithm:
- Large network delay between the application and the shared resource (the thing that is protected by the lock) can still cause the resource to receive a message from an application process that no longer holds the lock, so fencing is still required.
- A GC pause between the final clock check and the resource access will not be caught by the clock check. As I point out in the article: "Remember that GC can pause a running thread at any point, including the point that is maximally inconvenient for you".
- All the dependencies on accuracy of clock measurement still hold.
Hello Martin, thanks for your reply. Network delays between the app and the shared resource, and a GC pause *after* the check, but before doing the actual work, are all conceptually exactly the same as the "point 1" of your argument, that is, GC pauses (or other pauses) make the algo require an incremental token. So, regarding the safety of the algorithm itself, the only remaining thing would be the dependency on clock drifts, that can be argued depending on point of view. So I'm sorry to have to say that IMHO the current version of the article, by showing the wrong implementation of the algorithm, and not citing the equivalence of GC pauses processing the shared resource, with GC pauses immediately after the token is acquired, does not provide a fair picture.
Bottom line is for an application programmer like me this implementation looks doubtful enough not to use it in production systems. If this does not work perfectly then can introduce bugs which will be impossible to fix.
"Instead, please use a proper consensus system such as ZooKeeper, probably."
I literally skimmed the entire article for this part. :-)
That said, it's a good writeup, and very informative. Thanks for sharing!
Today, I learn and think about the "redlock:. I am very agree you!
It's a long dated post, nonetheless it is particularly relevant today with the wide distribution of blockchain systems.As it is stated, due to the fundamental asynchronicity in communication among peers, a cascade of problems arise when we have the introduction of lock leases that expire.However, if we remove the expiring function of lock leases we run into the problem of a peer that successfully acquired a lock (by majority voting) never releasing the lease and holding onto the lock forever (whether by malicious intent, network problems, etc).On the other hand, we may solve the problem of a lock being held indefinitely by making the majority vote for a forced release of the lock.If the majority already voted for the forced release of a lock, and in the case that the peer that held the lock comes online and tries to write or release the lock afterwards the forced release, the operation will simply be rejected by the majority and the node that used to own the lock will revert any operations in the context of its acquired lock.In that sense, the lock manager itself is distributed among the peers that by majority voting perform three distinct operations: grant lock, commit + lock release, revert + forced release.
But wouldn't this sort of fencing token scheme imply that the storage has a critical section where it can compare the current token with the last one it received from the clients? This critical section alone seems enough to avoid conflicts. Also, with this approach, the storage could simply store object versions, and reject operations that didn't specify the most recent version.
What I'm saying is, it sure looks like if it's possible to implement fenced updates in the storage at all, then you don't need a separate lock service anymore. What am I missing?
Well observed. I didn't want to go into too much detail in this already long post. There are situations in which a fencing token is still useful. For example, if there are several replicas of the storage service, then each replica can independently check the freshness of the fencing token without coordinating with the other replicas. If a quorum (e.g. 2 out of 3) of replicas accept a fencing token as valid, then a write is reported as successful to the client.
This situation is actually very similar to the second phase of Paxos or Raft (if you replace "fencing token" with "term number"), where a leader tries to replicate a value to a quorum of nodes. The first phase of Paxos/Raft is the leader election, which corresponds closely to acquiring a distributed lock. Consensus algorithms ensure that there is at most one leader for a given term number, and that term numbers monotonically increase over time. Being the leader in a particular term number corresponds to acquiring the lock with a particular fencing token.
Thanks, Martin! Makes sense, but it also means that while trying to come up with a distributed lock managing algorithm we ended up having nothing but another consensus algorithm:)
Yes, this is exactly what I expressed in the comments here about 2 years ago. Well, I am glad I am not the only one who sees it this way :)
Hi Martin Kleppmann thanks for an excellent primer on the pitfalls of distributed locking.As an application developer who was thus far unaware of such subtleties of distributed locking: I have often relied on Postgres row locks (which is a lease really when you take the idle client session timeout into account) to implement mutual exclusion for "efficiency" use cases for the most part but also for "correctness" occasionally. I am not so sure anymore if this was such a good idea.I wonder if you have insight into the implementation of postgres row locks and if they can serve as a good (enough) drop in replacement for some of your recommended systems (i.e. Zookeeper) for the "correctness" use case?I have mainly used SELECT ... FOR UPDATE for mutual exclusion in addition to sometimes using the SELECT ... FOR UPDATE SKIP LOCKED for implementing work stealing schedulers. I am assuming same semantics are extended to postgres advisory locks too.At a first glance it seems like fencing should be built in considering all postgres locks are issued in transactional scope, although I imagine a lot of that depends on the configured transaction isolation level. Additionally considering that it's a relational database with master-slave replication, my gut feeling is that that should imply an implicit consensus but I would really appreciate if you could spare a thought on this topic.
SELECT ... FOR UPDATE
SELECT ... FOR UPDATE SKIP LOCKED
It depends what data/resource is being protected by the lock. If that resource is in Postgres itself (in the same Postgres instance), then it seems likely that the lock in Postgres is sufficient for protecting that resource from concurrent access. (You may need to dial the isolation level up to serializable to be sure, depending on the type of data access you're doing.)
However, if the resource you're protecting is outside of Postgres, you're back in the situation of needing a fencing token. I don't think that a SELECT ... FOR UPDATE lock returns any value that could be used for fencing, but you can probably construct a token generator yourself: put an integer counter value in the row that you're locking with SELECT ... FOR UPDATE, and increment it every time you acquire the lock.
One complicating detail is that if a client's connection to the database is interrupted, the transaction held by that client will eventually time out and be aborted, so any counter increments made within that transaction will be discarded. That would allow multiple clients to potentially obtain the same counter value, which would violate the requirements on the fencing token. You'd need to find some way of incrementing the counter that is not rolled back on failure. Maybe a PostgreSQL sequence object would be suitable, since IIRC its increments don't get rolled back on abort.
If you're doing failover on Postgres node failure, you need to make sure that the new leader is one that was synchronously replicated from the old leader, so that the new leader has seen all of the increments performed by the old leader.
You may also find www.stateful.co interesting (I'm a developer), which provided free hosted locks for distributed applications. I'm using it in a number of multi-server apps.
Hi Martin (Martin Kleppmann ), it seems to me that in order for the storage system which is protected by a distributed lock manager (DLM) to be able to participate in fencing, it has to be able to resolve concurrent requests on its own, because it must support the following operation (yes, the whole thing is a single operation, essentially compare-and-set (CAS)):
read the last-seen-fencing-number & compare it with the incoming-fencing-number, store the incoming-fencing-number if it is greater than the last-seen-fencing-number or reject the request with the incoming-fencing-number otherwise.
If a storage system can’t do such an operation, then the system can’t benefit from fencing. But if the system can do this, then such a system is able to resolve concurrent requests on its own because it has CAS, and the DLM becomes completely redundant for correctness purposes for such a system. Am I crazy, or am I correct? :)
It would be very cool if we could clarify this.
Can someone explain to me how distributed locks work at all? Let's assume that 2 seperate people on the internet ask two seperate loadbalanced webservers to solve the same problem at the exact same milisecond. Well, both of the web servers ask for a lock at the exact same milisecond. With latency, how can you be certain the lock will be distributed correctly to only one of the resources?
if A asks for a lock and redis sees there is no lock setand B asks for a lock and redis sees there is no lock setwouldn't it try to give both of them the lock?
I guess this is where the delay comes in..force a delayed response to ensure you only give the lock to one of them?
Sorry if it sounds noob but I'd love to understand.
I have been solving concurrency problems by pushing all of my actions I want performed into a queue and having a single consumer go through them in order. It works pretty well, but obviously it is relatively slow.
Please respond to the comment http://disq.us/p/1w33vj3 on Is Redlock safe? (http://antirez.com/news/101),
Note Martin's diagrams discuss a scenario where the lock service (e.g. Redis) is separate from the storage service (e.g., MySQL). If locks and storage were both the same Redis cluster, we can have high-speed atomic operations.
I would refer to the storage service as a "work ledger"; it is more aptly a log of work unit deliveries, and it may not be ACID-compliant, or low latency.
Predictable tokens are nice because it becomes clear at ledger write time when the sequence is out of order. But what may be more important is whether the work unit is unique in the ledger. Imagine in Martin's example if a second unexpected delay occurred to token 34. Assuming both work units are equally valid, you may want the soonest delivery, not the latest or most recent. For a distributed lock service, anyway. Conflict resolution is another concern/service entirely--typically factoring in more advanced metrics i.e. vector clocks.
Martin erroneously assumes that the lock service must be responsible for incrementing token ids. In fact, the application could also do this, with a simple map from Redis token uid to integer, or the storage system could AUTO_INCREMENT on INSERT and the application can map stored PK to Redis token on write commit. Those maps can be persisted to any shared storage system.
The magic in Martin's 2nd diagram is a simple unique constraint provided by the storage system. It solves the problem independent of the locking mechanism's reliability, and does so redundantly, assuming some waste work is okay.
Illustratively, if the work unit were "ask human to make coffee", Martin's solution would be discarding every good batch of coffee that is delivered late. In most software applications this is acceptable, but the locking mechanism to begin work is completely optional--it merely increases the probability of even distribution, without guaranteeing it. You could just as precisely run a coffee making lottery, only rewarding the most fast/efficient human delivery per cycle.
As the Redis "Distributed locks" documentation page currently states, the simple solution to achieve both minimal waste and high precision is to extend the TTL beyond all possible timing shenanigans. For example, 4 hours. Well managed systems should never experience network latency, GC pauses, etc. this long without health checks triggering service restarts. Whether your work unit is writing bank checks or dealing prescription medication--in systems where precision matters, we can trade speed for accuracy. If you want both high precision and high speeds, you'll need to get off of commodity hardware.
Note that high TTL does not have to mean low volume, if the workload is highly distributed and parallelized, and your error margin is low, and you release your locks once work transactions are committed. Identify your application cluster instances uniquely as well, and automatically release all locks held by them on startup (e.g. to free locks sooner on process restart).
By these criteria (waste, precision, speed), Martin's timing shenanigans eventually invalidate ALL distributed lock systems that are physically separated from their work ledger systems.
Sorry opportunistic fanboys, ZooKeeper, Kafka, etc. will not save you.
...from outdated clocks on locks between commodity box.
Great post! Nicely explains the fundamentals of distributed locking from a generic standpoint, besides the Redlock specific details. I stumbled upon this while researching on Curator's distributed locking recipe. Have you had any experience with that?
I have not used Curator, but a quick look at the docs https://curator.apache.org/... shows that the acquire() method does not return any value that could be used for fencing. That suggests to me that the Curator lock is not safe.
More generally with ZooKeeper you can use the zxid or the version number on the ZNode for fencing. Seems a bit problematic of Curator to not expose that value.
"More generally with ZooKeeper you can use the zxid or the version number on the ZNode for fencing"This is indeed similar to what I'm thinking about doing. In fact I'm thinking about having each client create a child znode under a well known znode, say /root/app_name with the name lock_hostName ,where hostName is the name of the physical node hosting the client. At any point in time, there could be just one child underneath /root/app. Then every time, a client successfully acquires a lock, it'd check for existence of children of /root/app. If there's none, the lock is legitimate (in the sense that it was NOT obtained by the current client due to a split brain kind of situation caused by the loss of connectivity of the previous lock owning client with ZK). Under such conditions, the current client can proceed with the work (that can be done by only ONE process at a time), and finally removes the child zNode before releasing the lock. If on the other hand, the client discovers a child under /root/app, it should throw up its arms and generate a FATAL alert, that should probably be resolved with manual intervention. Sorry for the long post. But, may be you can share some valuable feedback/thoughts on this approach.
great post! may i translate this article to korean?
Hi 지웅. I'd like to see your Korean version of this article. Did you translate this?
Sure, all articles on this site are licensed under creative commons: https://creativecommons.org... — you're welcome to translate it, and please include a link back to the original here.
Hi Martin Kleppmann , there is one confusion I'd like to hear your thoughts about. In the red lock solution, a client is considered to successfully acquire the lock should get locks from the majority of the redis masters(that is at least N/2+1 out of N redis instances). I'm expecting a more clear definition of the 'majority'. As in the scenario where there are 5 redis instances and 3 of them are down, then for all the clients, it is impossible that anyone would get the more than half of the locks. So should we always check how are running redis instances before we acquire a lock? But if so, every client may get different result at different time.
The definition of "majority" assumes that you know how many redis instances you have (N), which would typically be part of your fixed configuration of the system. Given that you know N, a majority requires that you get votes from more than half of the instances. If 3 out of 5 instances are down, that does not mean N=2; it is still the case that N=5. In this case you will not be able to get the three required votes, and so you will not be able to acquire the lock. You don't need to explicitly check whether an instance is down, because a crashed instance is handled exactly the same way as an instance you cannot reach due to a network problem, or an instance that fails to vote for any other reason.
That makes sense to me. I probably missed something about the solution to it, is it introduced in your article? Thx!
Why should the fencing token need to be monotonically increasing?
Wouldn't it just need to be different on each operation? E.g., couldn't a UUID be used? The key thing is checking whether the write token value is equal to the current token value. Less-thans and greater-thans wouldn't come into it.
The storage system that checks the fencing tokens doesn't know when a new client is granted the lock, it can only go by the token. If you used a UUID and a request turned up with a UUID that has never been seen before, how would the storage system know what to do? Perhaps the lock has been granted to a new client, so the new UUID should be accepted and the old UUID should be blocked. But perhaps the UUID is from a client that acquired the lock and then immediately paused before it was able to make any requests; in that case, a third client will have already acquired the lock, and the previously-unseen UUID should be blocked. Using monotonically increasing fencing tokens solves this conundrum.
Thanks, I see. hm... though now I wonder if locks/leases aren't needlessly complex for correctness. If I'm understanding this, the locking assumes the storage system provides (a) a reliable check of a token value coming in with the write request and (b) does the check as part of an atomic check-write operation.
With that, a storage system could ensure correctness without a lock service like this: it maintains a current token value for a resource. A client reads this token value along with the resource. With the write operation the client sends along a two part token: the first part is the original token value and second part is a new unique value. It must be unique so it is a UUID. The storage system, in its check-write atomic operation checks that the first part of the token matches the resource's current token value. If it matches the the second part of the incoming token is written as the new current token value of the resource (along with the rest of the write operation). If it doesn't match, the write fails.
I think that's correctness where the only assumptions are the uniqueness of the tokens provided by the clients with the write operations and the atomicity of the storage service's check-write operation -- no locking/leasing assumed.
???? (It would be very awkward to put question marks at the end of all the sentences, but please take all this as a question!)
The use of a lock/lease system, then, would be to answer the question: is something already intending to modify the resource? (Which is a question of efficiency, of course.)