We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.
Hi Martin, now that Kafka 0.11 has support for transactions could we use it in some way shape or form for distributed locks? Perhaps using an event sourced pattern where an aggregate tracks who is locking it?
I have a particularly complex use-case where multiple resources are to be locked together for the purpose of a group-related operation, so I'm looking at many different options. Postgres seems like a safe bet, but if I can take advantage of Kafka transactions that would be great.
Thanks for the article!
Not knowing all the details of your requirements it's hard to give an unequivocal answer, but my guess is that Kafka transactions are irrelevant here. The main feature of Kafka transactions is to allow you to atomically publish messages to several topic-partitions. I don't see how that would help you implement a distributed lock.
I am not sure Kafka is really the right tool for this particular purpose, although if you really want to use it, then I would suggest routing all lock requests through a single topic-partition. The monotonically increasing offset that Kafka assigns to messages in a partition could then be used for fencing.
Thanks Martin! The last part you mentioned is along the lines of what I was thinking: Aggregates would each map to a single topic-partition. A client may then attempt to lock (for correctness) one or more aggregates in a transaction. The assumption is that having "exactly-once" guarantees would help with updating these aggregates together, for example updating their "lockedBy", "lockedAt" properties.
"Instead, please use a proper consensus system such as ZooKeeper, probably."
I literally skimmed the entire article for this part. :-)
That said, it's a good writeup, and very informative. Thanks for sharing!
Today, I learn and think about the "redlock:. I am very agree you!
You may also find www.stateful.co interesting (I'm a developer), which provided free hosted locks for distributed applications. I'm using it in a number of multi-server apps.
Hi Martin (Martin Kleppmann ), it seems to me that in order for the storage system which is protected by a distributed lock manager (DLM) to be able to participate in fencing, it has to be able to resolve concurrent requests on its own, because it must support the following operation (yes, the whole thing is a single operation, essentially compare-and-set (CAS)):
read the last-seen-fencing-number & compare it with the incoming-fencing-number, store the incoming-fencing-number if it is greater than the last-seen-fencing-number or reject the request with the incoming-fencing-number otherwise.
If a storage system can’t do such an operation, then the system can’t benefit from fencing. But if the system can do this, then such a system is able to resolve concurrent requests on its own because it has CAS, and the DLM becomes completely redundant for correctness purposes for such a system. Am I crazy, or am I correct? :)
It would be very cool if we could clarify this.
Can someone explain to me how distributed locks work at all? Let's assume that 2 seperate people on the internet ask two seperate loadbalanced webservers to solve the same problem at the exact same milisecond. Well, both of the web servers ask for a lock at the exact same milisecond. With latency, how can you be certain the lock will be distributed correctly to only one of the resources?
if A asks for a lock and redis sees there is no lock setand B asks for a lock and redis sees there is no lock setwouldn't it try to give both of them the lock?
I guess this is where the delay comes in..force a delayed response to ensure you only give the lock to one of them?
Sorry if it sounds noob but I'd love to understand.
I have been solving concurrency problems by pushing all of my actions I want performed into a queue and having a single consumer go through them in order. It works pretty well, but obviously it is relatively slow.
Please respond to the comment http://disq.us/p/1w33vj3 on Is Redlock safe? (http://antirez.com/news/101),
Note Martin's diagrams discuss a scenario where the lock service (e.g. Redis) is separate from the storage service (e.g., MySQL). If locks and storage were both the same Redis cluster, we can have high-speed atomic operations.
I would refer to the storage service as a "work ledger"; it is more aptly a log of work unit deliveries, and it may not be ACID-compliant, or low latency.
Predictable tokens are nice because it becomes clear at ledger write time when the sequence is out of order. But what may be more important is whether the work unit is unique in the ledger. Imagine in Martin's example if a second unexpected delay occurred to token 34. Assuming both work units are equally valid, you may want the soonest delivery, not the latest or most recent. For a distributed lock service, anyway. Conflict resolution is another concern/service entirely--typically factoring in more advanced metrics i.e. vector clocks.
Martin erroneously assumes that the lock service must be responsible for incrementing token ids. In fact, the application could also do this, with a simple map from Redis token uid to integer, or the storage system could AUTO_INCREMENT on INSERT and the application can map stored PK to Redis token on write commit. Those maps can be persisted to any shared storage system.
The magic in Martin's 2nd diagram is a simple unique constraint provided by the storage system. It solves the problem independent of the locking mechanism's reliability, and does so redundantly, assuming some waste work is okay.
Illustratively, if the work unit were "ask human to make coffee", Martin's solution would be discarding every good batch of coffee that is delivered late. In most software applications this is acceptable, but the locking mechanism to begin work is completely optional--it merely increases the probability of even distribution, without guaranteeing it. You could just as precisely run a coffee making lottery, only rewarding the most fast/efficient human delivery per cycle.
As the Redis "Distributed locks" documentation page currently states, the simple solution to achieve both minimal waste and high precision is to extend the TTL beyond all possible timing shenanigans. For example, 4 hours. Well managed systems should never experience network latency, GC pauses, etc. this long without health checks triggering service restarts. Whether your work unit is writing bank checks or dealing prescription medication--in systems where precision matters, we can trade speed for accuracy. If you want both high precision and high speeds, you'll need to get off of commodity hardware.
Note that high TTL does not have to mean low volume, if the workload is highly distributed and parallelized, and your error margin is low, and you release your locks once work transactions are committed. Identify your application cluster instances uniquely as well, and automatically release all locks held by them on startup (e.g. to free locks sooner on process restart).
By these criteria (waste, precision, speed), Martin's timing shenanigans eventually invalidate ALL distributed lock systems that are physically separated from their work ledger systems.
Sorry opportunistic fanboys, ZooKeeper, Kafka, etc. will not save you.
...from outdated clocks on locks between commodity box.
Great post! Nicely explains the fundamentals of distributed locking from a generic standpoint, besides the Redlock specific details. I stumbled upon this while researching on Curator's distributed locking recipe. Have you had any experience with that?
I have not used Curator, but a quick look at the docs https://curator.apache.org/... shows that the acquire() method does not return any value that could be used for fencing. That suggests to me that the Curator lock is not safe.
More generally with ZooKeeper you can use the zxid or the version number on the ZNode for fencing. Seems a bit problematic of Curator to not expose that value.
"More generally with ZooKeeper you can use the zxid or the version number on the ZNode for fencing"This is indeed similar to what I'm thinking about doing. In fact I'm thinking about having each client create a child znode under a well known znode, say /root/app_name with the name lock_hostName ,where hostName is the name of the physical node hosting the client. At any point in time, there could be just one child underneath /root/app. Then every time, a client successfully acquires a lock, it'd check for existence of children of /root/app. If there's none, the lock is legitimate (in the sense that it was NOT obtained by the current client due to a split brain kind of situation caused by the loss of connectivity of the previous lock owning client with ZK). Under such conditions, the current client can proceed with the work (that can be done by only ONE process at a time), and finally removes the child zNode before releasing the lock. If on the other hand, the client discovers a child under /root/app, it should throw up its arms and generate a FATAL alert, that should probably be resolved with manual intervention. Sorry for the long post. But, may be you can share some valuable feedback/thoughts on this approach.
Redis works but should be approaches with caution for anything involving more than one instance and even more than one concurrent writer. Simply look at his 5 year odyssey just to implement existing concurrency patterns and even then with limited success. I think if he was honest he would put himself in the category of "creator of stick-figures in crayon" compared to an animator or graphics artist, to use an analogy. I'm guessing it was trial and error since actually understanding the algorithms was a bridge too far. It's just that stick-figures in crayon hit 90% of the initial use-cases, and to be early is sometimes better than being good.
great post! may i translate this article to korean?
Hi 지웅. I'd like to see your Korean version of this article. Did you translate this?
Sure, all articles on this site are licensed under creative commons: https://creativecommons.org... — you're welcome to translate it, and please include a link back to the original here.
Hi Martin Kleppmann , there is one confusion I'd like to hear your thoughts about. In the red lock solution, a client is considered to successfully acquire the lock should get locks from the majority of the redis masters(that is at least N/2+1 out of N redis instances). I'm expecting a more clear definition of the 'majority'. As in the scenario where there are 5 redis instances and 3 of them are down, then for all the clients, it is impossible that anyone would get the more than half of the locks. So should we always check how are running redis instances before we acquire a lock? But if so, every client may get different result at different time.
The definition of "majority" assumes that you know how many redis instances you have (N), which would typically be part of your fixed configuration of the system. Given that you know N, a majority requires that you get votes from more than half of the instances. If 3 out of 5 instances are down, that does not mean N=2; it is still the case that N=5. In this case you will not be able to get the three required votes, and so you will not be able to acquire the lock. You don't need to explicitly check whether an instance is down, because a crashed instance is handled exactly the same way as an instance you cannot reach due to a network problem, or an instance that fails to vote for any other reason.
That makes sense to me. I probably missed something about the solution to it, is it introduced in your article? Thx!
Why should the fencing token need to be monotonically increasing?
Wouldn't it just need to be different on each operation? E.g., couldn't a UUID be used? The key thing is checking whether the write token value is equal to the current token value. Less-thans and greater-thans wouldn't come into it.
The storage system that checks the fencing tokens doesn't know when a new client is granted the lock, it can only go by the token. If you used a UUID and a request turned up with a UUID that has never been seen before, how would the storage system know what to do? Perhaps the lock has been granted to a new client, so the new UUID should be accepted and the old UUID should be blocked. But perhaps the UUID is from a client that acquired the lock and then immediately paused before it was able to make any requests; in that case, a third client will have already acquired the lock, and the previously-unseen UUID should be blocked. Using monotonically increasing fencing tokens solves this conundrum.
Thanks, I see. hm... though now I wonder if locks/leases aren't needlessly complex for correctness. If I'm understanding this, the locking assumes the storage system provides (a) a reliable check of a token value coming in with the write request and (b) does the check as part of an atomic check-write operation.
With that, a storage system could ensure correctness without a lock service like this: it maintains a current token value for a resource. A client reads this token value along with the resource. With the write operation the client sends along a two part token: the first part is the original token value and second part is a new unique value. It must be unique so it is a UUID. The storage system, in its check-write atomic operation checks that the first part of the token matches the resource's current token value. If it matches the the second part of the incoming token is written as the new current token value of the resource (along with the rest of the write operation). If it doesn't match, the write fails.
I think that's correctness where the only assumptions are the uniqueness of the tokens provided by the clients with the write operations and the atomicity of the storage service's check-write operation -- no locking/leasing assumed.
???? (It would be very awkward to put question marks at the end of all the sentences, but please take all this as a question!)
The use of a lock/lease system, then, would be to answer the question: is something already intending to modify the resource? (Which is a question of efficiency, of course.)
Great post! But I think a clarification has to be made regarding fencing tokens: What happens in the (unlikely) case that client 2 also suffers a STW pause and they write with increasing successive tokens?
The storage system simply maintains the ‘ratchet’ that the token can only stay the same or increase, but not decrease. Thus, if client 2 pauses and client 3 acquires the lock, client 2 will have a lesser token. If client 3 has already made a request to the storage service, client 1 and 2 will both be blocked.
what I mean is:client 1 acquires lockclient 1 stopsclient 1 lock expiresclient 2 acquires lockclient 2 stopsclient 1 resumes and writes to storageclient 2 resumes and writes to storage
This is more unlikely than the scenario you presented, but still possible, and breaks the desired "correctness" since both writes are accepted.
This is a really good resource if someone is learning for distributed locking. Here's a good example on how to use it in a distributed cache.http://blogs.alachisoft.com...
Note for the readers: that there is an error in the way the Redlock algorithm is used in the blog post: the final step after the majority is acquired, is to check if the total time elapsed is already over the lock TTL, and in such a case the client does not consider the lock as valid. This makes Redlock immune from client <-> lock-server delays in the messages, and makes every other delay *after* the lock validity is tested as any other GC pause during the processing of the locked resource. This is also equivalent to what happens, when using a remote lock server, if the "OK, you have the lock" reply from the server remains in the kernel buffers since the socket pauses before reading it. So where in this blog post its assumed that network delays or GC pauses during the lock acquisition stage are a problem, there is an error.
This is correct, I had overlooked that additional clock check after messages are received. However, I believe that the additional check does not substantially alter the properties of the algorithm:
- Large network delay between the application and the shared resource (the thing that is protected by the lock) can still cause the resource to receive a message from an application process that no longer holds the lock, so fencing is still required.
- A GC pause between the final clock check and the resource access will not be caught by the clock check. As I point out in the article: "Remember that GC can pause a running thread at any point, including the point that is maximally inconvenient for you".
- All the dependencies on accuracy of clock measurement still hold.
Hello Martin, thanks for your reply. Network delays between the app and the shared resource, and a GC pause *after* the check, but before doing the actual work, are all conceptually exactly the same as the "point 1" of your argument, that is, GC pauses (or other pauses) make the algo require an incremental token. So, regarding the safety of the algorithm itself, the only remaining thing would be the dependency on clock drifts, that can be argued depending on point of view. So I'm sorry to have to say that IMHO the current version of the article, by showing the wrong implementation of the algorithm, and not citing the equivalence of GC pauses processing the shared resource, with GC pauses immediately after the token is acquired, does not provide a fair picture.
"that that" in paragraph
"However, Redlock is not like this. Its safety depends on a lot of timing assumptions: it assumes that all Redis nodes hold keys for approximately the right length of time before expiring; that that the network delay is small compared to the expiry duration; and that process pauses are much shorter than the expiry duration."
Great article though, I'm just being an editor :P
Great write up. Especially in terms of distilling the theory into examples. Perhaps it would be worth reiterating (in the paragraph before the conclusion) that paxos, raft et al are still safe even if the system degenerates into the async model, but progress is no longer guaranteed (i.e. Likeness)
Autocorrect killed me :) likeness = liveness