We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Arun Chouhan • 1 year ago

After reading the article one conclusion can be derived i.e. it is specific to Redlock which has been implemented in Redisson library. But I would like to implement a lock from scratch using incrementing atomic counters in redis without setting the timeout .

Implementation would look like:
getLock(key) : increment the key by 1, if the value i get back is 1 then lock acquired otherwise some one has already taken the lock.

releaseLock(ey): delete the key.

is this implementation safe enough for correctness?

Valentin Kovalenko • 1 year ago

Not having a lock timeout means that any process that grabs the lock and becomes unavailable, makes the whole LM (lock manager) unusable. So if you do not have a timeout, then there is no sence in DLM (distributed LM) at all (though, there is no sense in DLM anyway...). Thus if you simply need an LM, then using PostgreSQL advisory session locks makes more sense than doing it with Redis, because PostgreSQL session locks are at least auto-unlocked if the holding session disconnects, thus an unavailable process holding the lock will not render a PostgreSQL-based LM unusable.

climbsocial • 3 years ago

Hi Martin, now that Kafka 0.11 has support for transactions could we use it in some way shape or form for distributed locks? Perhaps using an event sourced pattern where an aggregate tracks who is locking it?

I have a particularly complex use-case where multiple resources are to be locked together for the purpose of a group-related operation, so I'm looking at many different options. Postgres seems like a safe bet, but if I can take advantage of Kafka transactions that would be great.

Thanks for the article!

Martin Kleppmann • 3 years ago

Not knowing all the details of your requirements it's hard to give an unequivocal answer, but my guess is that Kafka transactions are irrelevant here. The main feature of Kafka transactions is to allow you to atomically publish messages to several topic-partitions. I don't see how that would help you implement a distributed lock.

I am not sure Kafka is really the right tool for this particular purpose, although if you really want to use it, then I would suggest routing all lock requests through a single topic-partition. The monotonically increasing offset that Kafka assigns to messages in a partition could then be used for fencing.

climbsocial • 3 years ago

Thanks Martin! The last part you mentioned is along the lines of what I was thinking: Aggregates would each map to a single topic-partition. A client may then attempt to lock (for correctness) one or more aggregates in a transaction. The assumption is that having "exactly-once" guarantees would help with updating these aggregates together, for example updating their "lockedBy", "lockedAt" properties.

idelvall • 4 years ago

Great post! But I think a clarification has to be made regarding fencing tokens: What happens in the (unlikely) case that client 2 also suffers a STW pause and they write with increasing successive tokens?

Martin Kleppmann • 4 years ago

The storage system simply maintains the ‘ratchet’ that the token can only stay the same or increase, but not decrease. Thus, if client 2 pauses and client 3 acquires the lock, client 2 will have a lesser token. If client 3 has already made a request to the storage service, client 1 and 2 will both be blocked.

idelvall • 4 years ago

what I mean is:
client 1 acquires lock
client 1 stops
client 1 lock expires
client 2 acquires lock
client 2 stops
client 1 resumes and writes to storage
client 2 resumes and writes to storage

This is more unlikely than the scenario you presented, but still possible, and breaks the desired "correctness" since both writes are accepted.

Martin Kleppmann • 4 years ago

Oh, I see what you mean now. Yes, you have a good point — that scenario does look risky. To reason about it properly, I think we would need to make some assumptions about the semantics of the write that occurs under the lock; I think it would probably turn out to be safe in some cases, but unsafe in others. I will think about it some more.

In consensus algorithms that use epochs/ballot numbers/round numbers (which have a similar function to the fencing token), the algorithm works because the type of write is constrained. Thus, Paxos for example can maintain the invariant that if one node decides x, no other node will decide a value other than x, even in the presence of arbitrary pauses. If unconstrained writes were allowed, the safety property could be violated.

Perhaps it would be useful to regard a storage system with fencing token support as participating in an approximation of a consensus algorithm, but further protocol constraints would be required to make it safe in all circumstances?

idelvall • 4 years ago

After some thinking, this is how I would do it:

In lock manager:
- If since last released lock, no other (later) has been expired, then next returned token is an "ordinary token" (incrementing the previous one)
- Otherwise, the next returned token is a "paired token" containing major/minor information, being major: the current token numbering, and minor: the numbering of the first token not released at this time

In lock-aware resources:
- Keep record of the highest accepted token
- If the current token is ordinary then behave as usual (rejecting when it's not greater than the highest)
- If the current token is paired (granted after some others expiration) then accept only if its minor number is greater than highest known

This would be consistent with my previous example:
-client 1 acquires lock (token: "1")
-client 1 stops
-client 1 lock expires
-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")
-client 2 stops
-client 1 resumes and writes to storage
-storage accepts token and sets the highest known to "1"
-client 2 resumes and writes to storage
-storage rejects token "2:1" since "1" is not greater than highest ("1")

what do you think?

cquliaoli • 1 year ago

I think your method does not work:
-client 1 acquires lock (token: "1")
-client 1 stops
-client 1 lock expires
-client 2 acquires lock (token "2:1", meaning "lock 2 given after 1 expired")
-client 2 do not write storage
-client 2 released lock
-client 3 acquires lock (token: "3")
-client 3 stops
-client 3 lock expires
-client 1 resumes and writes to storage
-storage accepts token and sets the highest known to "1"
-client 3 resumes and writes to storage
-storage accepts token

Valentin Kovalenko • 2 years ago

This approach means that the storage would reject all requests after the expiration of lock (token: 1) except for the client 1 requests until client 1 explicitly goes and releases the expired lock (token: 1). And this would eliminate the lock expiration idea: despite a lock can expire, the whole system (lock management + the storage) still must wait for it to be released if the owner of the lock made at least one request to the storage.

idelvall • 2 years ago

No, this approach doesn't mean that.

First, let's clarify this idea was made in the context of fencing tokens (locking without making any timing assumptions, accepting a token only based on its value and the value of the previous accepted one).

Once the expiration of the first lock would occur, the locking system would give the lock to client 2 (token 2), but whatever comes first at the write operation would succeed. If 2 comes first, 1 is disabled automatically by the ordinary case (monotonically increasing verification), but if 1 comes first, 2 would be disabled by the new "paired" case

Martin Kleppmann • 4 years ago

Interesting idea. Seems plausible at first glance, but it's the kind of subtle protocol that would benefit from a formal proof of correctness. In these distributed systems topics it's terribly easy to accidentally overlook some edge-case.

Jeff Jeff • 4 years ago

This actually looks a lot like the algorithm proposed in the paper "On Optimistic Methods for Concurrency Control" by Kung and Robinson, 1981 (http://www.eecs.harvard.edu...

I believe that this paper addresses the exact issues that idelvall mentioned and also includes a formal proof. Additionally, in the case proposed as long as client 1 and client 2 have no conflicts in what they are writing then both would still be permitted, however in this case if there was a conflict then client 2 would be rejected prior to starting its write. Would be interested in hearing your thoughts on this though.

AMIT JAIN • 1 year ago
Adrian Petre • 2 years ago

Jeff Jeff
hi, the link is broken

idelvall • 4 years ago

Agreed, just an idea. I'll take a look into Chubby's paper and see how they handle this

xingdl2007 • 1 year ago

Chubby solved this problem by introuducing sequencer.

Quote from 2.4:
At any time, a lock holder may request a sequencer , an opaque byte-string that describes the state of the lock immediately after acquisition. It contains the name of the lock, the mode in which it was acquired(exclusive or shared), and the lock generation number. The client passes the sequencer to servers(such as file servers) if it expectes the operation to be protected by the lock. The recipient server is expected to test whether the sequencer is still valid and has the appropriate mode; if not, it should reject the request. The validity of a sequencer can be cheched against the server's Chubby cache or, if the server does not wish to maintain a session with Chubby, against the most recent sequencer that the server has observed.

In my opinion, it is only safe to check with Chubby cache, or it suffers the same problem.
And Chubby paper mentions another imperfect solution, lock-delay, which says, if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a lock-delay(1 minute) period.

antirez • 5 years ago

Note for the readers: that there is an error in the way the Redlock algorithm is used in the blog post: the final step after the majority is acquired, is to check if the total time elapsed is already over the lock TTL, and in such a case the client does not consider the lock as valid. This makes Redlock immune from client <-> lock-server delays in the messages, and makes every other delay *after* the lock validity is tested as any other GC pause during the processing of the locked resource. This is also equivalent to what happens, when using a remote lock server, if the "OK, you have the lock" reply from the server remains in the kernel buffers since the socket pauses before reading it. So where in this blog post its assumed that network delays or GC pauses during the lock acquisition stage are a problem, there is an error.

Martin Kleppmann • 5 years ago

This is correct, I had overlooked that additional clock check after messages are received. However, I believe that the additional check does not substantially alter the properties of the algorithm:

- Large network delay between the application and the shared resource (the thing that is protected by the lock) can still cause the resource to receive a message from an application process that no longer holds the lock, so fencing is still required.

- A GC pause between the final clock check and the resource access will not be caught by the clock check. As I point out in the article: "Remember that GC can pause a running thread at any point, including the point that is maximally inconvenient for you".

- All the dependencies on accuracy of clock measurement still hold.

antirez • 5 years ago

Hello Martin, thanks for your reply. Network delays between the app and the shared resource, and a GC pause *after* the check, but before doing the actual work, are all conceptually exactly the same as the "point 1" of your argument, that is, GC pauses (or other pauses) make the algo require an incremental token. So, regarding the safety of the algorithm itself, the only remaining thing would be the dependency on clock drifts, that can be argued depending on point of view. So I'm sorry to have to say that IMHO the current version of the article, by showing the wrong implementation of the algorithm, and not citing the equivalence of GC pauses processing the shared resource, with GC pauses immediately after the token is acquired, does not provide a fair picture.

MarutSingh • 4 years ago

Bottom line is for an application programmer like me this implementation looks doubtful enough not to use it in production systems. If this does not work perfectly then can introduce bugs which will be impossible to fix.

Douglas Muth • 5 years ago

"Instead, please use a proper consensus system such as ZooKeeper, probably."

I literally skimmed the entire article for this part. :-)

That said, it's a good writeup, and very informative. Thanks for sharing!

李涛 • 3 years ago

Today, I learn and think about the "redlock:. I am very agree you!

Rafael Werlang • 1 month ago

It's a long dated post, nonetheless it is particularly relevant today with the wide distribution of blockchain systems.
As it is stated, due to the fundamental asynchronicity in communication among peers, a cascade of problems arise when we have the introduction of lock leases that expire.
However, if we remove the expiring function of lock leases we run into the problem of a peer that successfully acquired a lock (by majority voting) never releasing the lease and holding onto the lock forever (whether by malicious intent, network problems, etc).
On the other hand, we may solve the problem of a lock being held indefinitely by making the majority vote for a forced release of the lock.
If the majority already voted for the forced release of a lock, and in the case that the peer that held the lock comes online and tries to write or release the lock afterwards the forced release, the operation will simply be rejected by the majority and the node that used to own the lock will revert any operations in the context of its acquired lock.
In that sense, the lock manager itself is distributed among the peers that by majority voting perform three distinct operations: grant lock, commit + lock release, revert + forced release.

Guest • 4 months ago

Hello.

But wouldn't this sort of fencing token scheme imply that the storage has a critical section where it can compare the current token with the last one it received from the clients? This critical section alone seems enough to avoid conflicts. Also, with this approach, the storage could simply store object versions, and reject operations that didn't specify the most recent version.

What I'm saying is, it sure looks like if it's possible to implement fenced updates in the storage at all, then you don't need a separate lock service anymore. What am I missing?

Martin Kleppmann • 4 months ago

Well observed. I didn't want to go into too much detail in this already long post. There are situations in which a fencing token is still useful. For example, if there are several replicas of the storage service, then each replica can independently check the freshness of the fencing token without coordinating with the other replicas. If a quorum (e.g. 2 out of 3) of replicas accept a fencing token as valid, then a write is reported as successful to the client.

This situation is actually very similar to the second phase of Paxos or Raft (if you replace "fencing token" with "term number"), where a leader tries to replicate a value to a quorum of nodes. The first phase of Paxos/Raft is the leader election, which corresponds closely to acquiring a distributed lock. Consensus algorithms ensure that there is at most one leader for a given term number, and that term numbers monotonically increase over time. Being the leader in a particular term number corresponds to acquiring the lock with a particular fencing token.

Valentin Kovalenko • 4 months ago

Thanks, Martin! Makes sense, but it also means that while trying to come up with a distributed lock managing algorithm we ended up having nothing but another consensus algorithm:)

Valentin Kovalenko • 4 months ago

Yes, this is exactly what I expressed in the comments here about 2 years ago. Well, I am glad I am not the only one who sees it this way :)

DN • 1 year ago

Hi Martin Kleppmann thanks for an excellent primer on the pitfalls of distributed locking.
As an application developer who was thus far unaware of such subtleties of distributed locking: I have often relied on Postgres row locks (which is a lease really when you take the idle client session timeout into account) to implement mutual exclusion for "efficiency" use cases for the most part but also for "correctness" occasionally. I am not so sure anymore if this was such a good idea.
I wonder if you have insight into the implementation of postgres row locks and if they can serve as a good (enough) drop in replacement for some of your recommended systems (i.e. Zookeeper) for the "correctness" use case?
I have mainly used SELECT ... FOR UPDATE for mutual exclusion in addition to sometimes using the SELECT ... FOR UPDATE SKIP LOCKED for implementing work stealing schedulers. I am assuming same semantics are extended to postgres advisory locks too.
At a first glance it seems like fencing should be built in considering all postgres locks are issued in transactional scope, although I imagine a lot of that depends on the configured transaction isolation level.
Additionally considering that it's a relational database with master-slave replication, my gut feeling is that that should imply an implicit consensus but I would really appreciate if you could spare a thought on this topic.

Martin Kleppmann • 1 year ago

It depends what data/resource is being protected by the lock. If that resource is in Postgres itself (in the same Postgres instance), then it seems likely that the lock in Postgres is sufficient for protecting that resource from concurrent access. (You may need to dial the isolation level up to serializable to be sure, depending on the type of data access you're doing.)

However, if the resource you're protecting is outside of Postgres, you're back in the situation of needing a fencing token. I don't think that a SELECT ... FOR UPDATE lock returns any value that could be used for fencing, but you can probably construct a token generator yourself: put an integer counter value in the row that you're locking with SELECT ... FOR UPDATE, and increment it every time you acquire the lock.

One complicating detail is that if a client's connection to the database is interrupted, the transaction held by that client will eventually time out and be aborted, so any counter increments made within that transaction will be discarded. That would allow multiple clients to potentially obtain the same counter value, which would violate the requirements on the fencing token. You'd need to find some way of incrementing the counter that is not rolled back on failure. Maybe a PostgreSQL sequence object would be suitable, since IIRC its increments don't get rolled back on abort.

If you're doing failover on Postgres node failure, you need to make sure that the new leader is one that was synchronously replicated from the old leader, so that the new leader has seen all of the increments performed by the old leader.

Yegor Bugayenko • 1 year ago

You may also find www.stateful.co interesting (I'm a developer), which provided free hosted locks for distributed applications. I'm using it in a number of multi-server apps.

Valentin Kovalenko • 2 years ago

Hi Martin (Martin Kleppmann ), it seems to me that in order for the storage system which is protected by a distributed lock manager (DLM) to be able to participate in fencing, it has to be able to resolve concurrent requests on its own, because it must support the following operation (yes, the whole thing is a single operation, essentially compare-and-set (CAS)):

read the last-seen-fencing-number & compare it with the incoming-fencing-number, store the incoming-fencing-number if it is greater than the last-seen-fencing-number or reject the request with the incoming-fencing-number otherwise.

If a storage system can’t do such an operation, then the system can’t benefit from fencing. But if the system can do this, then such a system is able to resolve concurrent requests on its own because it has CAS, and the DLM becomes completely redundant for correctness purposes for such a system. Am I crazy, or am I correct? :)

It would be very cool if we could clarify this.

Ben • 2 years ago

Can someone explain to me how distributed locks work at all? Let's assume that 2 seperate people on the internet ask two seperate loadbalanced webservers to solve the same problem at the exact same milisecond. Well, both of the web servers ask for a lock at the exact same milisecond. With latency, how can you be certain the lock will be distributed correctly to only one of the resources?

if A asks for a lock and redis sees there is no lock set
and B asks for a lock and redis sees there is no lock set
wouldn't it try to give both of them the lock?

I guess this is where the delay comes in..force a delayed response to ensure you only give the lock to one of them?

Sorry if it sounds noob but I'd love to understand.

I have been solving concurrency problems by pushing all of my actions I want performed into a queue and having a single consumer go through them in order. It works pretty well, but obviously it is relatively slow.

Adrian Petre • 2 years ago

Please respond to the comment http://disq.us/p/1w33vj3 on Is Redlock safe? (http://antirez.com/news/101),

Thank you!

Mike S. • 3 years ago

Note Martin's diagrams discuss a scenario where the lock service (e.g. Redis) is separate from the storage service (e.g., MySQL). If locks and storage were both the same Redis cluster, we can have high-speed atomic operations.

I would refer to the storage service as a "work ledger"; it is more aptly a log of work unit deliveries, and it may not be ACID-compliant, or low latency.

Predictable tokens are nice because it becomes clear at ledger write time when the sequence is out of order. But what may be more important is whether the work unit is unique in the ledger. Imagine in Martin's example if a second unexpected delay occurred to token 34. Assuming both work units are equally valid, you may want the soonest delivery, not the latest or most recent. For a distributed lock service, anyway. Conflict resolution is another concern/service entirely--typically factoring in more advanced metrics i.e. vector clocks.

Martin erroneously assumes that the lock service must be responsible for incrementing token ids. In fact, the application could also do this, with a simple map from Redis token uid to integer, or the storage system could AUTO_INCREMENT on INSERT and the application can map stored PK to Redis token on write commit. Those maps can be persisted to any shared storage system.

The magic in Martin's 2nd diagram is a simple unique constraint provided by the storage system. It solves the problem independent of the locking mechanism's reliability, and does so redundantly, assuming some waste work is okay.

Illustratively, if the work unit were "ask human to make coffee", Martin's solution would be discarding every good batch of coffee that is delivered late. In most software applications this is acceptable, but the locking mechanism to begin work is completely optional--it merely increases the probability of even distribution, without guaranteeing it. You could just as precisely run a coffee making lottery, only rewarding the most fast/efficient human delivery per cycle.

As the Redis "Distributed locks" documentation page currently states, the simple solution to achieve both minimal waste and high precision is to extend the TTL beyond all possible timing shenanigans. For example, 4 hours. Well managed systems should never experience network latency, GC pauses, etc. this long without health checks triggering service restarts. Whether your work unit is writing bank checks or dealing prescription medication--in systems where precision matters, we can trade speed for accuracy. If you want both high precision and high speeds, you'll need to get off of commodity hardware.

Note that high TTL does not have to mean low volume, if the workload is highly distributed and parallelized, and your error margin is low, and you release your locks once work transactions are committed. Identify your application cluster instances uniquely as well, and automatically release all locks held by them on startup (e.g. to free locks sooner on process restart).

By these criteria (waste, precision, speed), Martin's timing shenanigans eventually invalidate ALL distributed lock systems that are physically separated from their work ledger systems.

Sorry opportunistic fanboys, ZooKeeper, Kafka, etc. will not save you.

Mike S. • 3 years ago

...from outdated clocks on locks between commodity box.

eternal_solver • 3 years ago

Great post! Nicely explains the fundamentals of distributed locking from a generic standpoint, besides the Redlock specific details. I stumbled upon this while researching on Curator's distributed locking recipe.
Have you had any experience with that?

Martin Kleppmann • 3 years ago

I have not used Curator, but a quick look at the docs https://curator.apache.org/... shows that the acquire() method does not return any value that could be used for fencing. That suggests to me that the Curator lock is not safe.

More generally with ZooKeeper you can use the zxid or the version number on the ZNode for fencing. Seems a bit problematic of Curator to not expose that value.

eternal_solver • 3 years ago

"More generally with ZooKeeper you can use the zxid or the version number on the ZNode for fencing"
This is indeed similar to what I'm thinking about doing. In fact I'm thinking about having each client create a child znode under a well known znode, say /root/app_name with the name lock_hostName ,where hostName is the name of the physical node hosting the client. At any point in time, there could be just one child underneath /root/app. Then every time, a client successfully acquires a lock, it'd check for existence of children of /root/app. If there's none, the lock is legitimate (in the sense that it was NOT obtained by the current client due to a split brain kind of situation caused by the loss of connectivity of the previous lock owning client with ZK). Under such conditions, the current client can proceed with the work (that can be done by only ONE process at a time), and finally removes the child zNode before releasing the lock. If on the other hand, the client discovers a child under /root/app, it should throw up its arms and generate a FATAL alert, that should probably be resolved with manual intervention.
Sorry for the long post. But, may be you can share some valuable feedback/thoughts on this approach.

서지웅 • 3 years ago

great post! may i translate this article to korean?

hackerwins • 3 years ago

Hi 지웅. I'd like to see your Korean version of this article. Did you translate this?

Martin Kleppmann • 3 years ago

Sure, all articles on this site are licensed under creative commons: https://creativecommons.org... — you're welcome to translate it, and please include a link back to the original here.

Huaqing Li • 3 years ago

Hi Martin Kleppmann , there is one confusion I'd like to hear your thoughts about. In the red lock solution, a client is considered to successfully acquire the lock should get locks from the majority of the redis masters(that is at least N/2+1 out of N redis instances). I'm expecting a more clear definition of the 'majority'. As in the scenario where there are 5 redis instances and 3 of them are down, then for all the clients, it is impossible that anyone would get the more than half of the locks. So should we always check how are running redis instances before we acquire a lock? But if so, every client may get different result at different time.

Martin Kleppmann • 3 years ago

The definition of "majority" assumes that you know how many redis instances you have (N), which would typically be part of your fixed configuration of the system. Given that you know N, a majority requires that you get votes from more than half of the instances. If 3 out of 5 instances are down, that does not mean N=2; it is still the case that N=5. In this case you will not be able to get the three required votes, and so you will not be able to acquire the lock. You don't need to explicitly check whether an instance is down, because a crashed instance is handled exactly the same way as an instance you cannot reach due to a network problem, or an instance that fails to vote for any other reason.

Huaqing Li • 3 years ago

That makes sense to me. I probably missed something about the solution to it, is it introduced in your article? Thx!

John Mullaney • 4 years ago

Why should the fencing token need to be monotonically increasing?

Wouldn't it just need to be different on each operation? E.g., couldn't a UUID be used? The key thing is checking whether the write token value is equal to the current token value. Less-thans and greater-thans wouldn't come into it.

Martin Kleppmann • 4 years ago

The storage system that checks the fencing tokens doesn't know when a new client is granted the lock, it can only go by the token. If you used a UUID and a request turned up with a UUID that has never been seen before, how would the storage system know what to do? Perhaps the lock has been granted to a new client, so the new UUID should be accepted and the old UUID should be blocked. But perhaps the UUID is from a client that acquired the lock and then immediately paused before it was able to make any requests; in that case, a third client will have already acquired the lock, and the previously-unseen UUID should be blocked. Using monotonically increasing fencing tokens solves this conundrum.

John Mullaney • 4 years ago

Thanks, I see. hm... though now I wonder if locks/leases aren't needlessly complex for correctness. If I'm understanding this, the locking assumes the storage system provides (a) a reliable check of a token value coming in with the write request and (b) does the check as part of an atomic check-write operation.

With that, a storage system could ensure correctness without a lock service like this: it maintains a current token value for a resource. A client reads this token value along with the resource. With the write operation the client sends along a two part token: the first part is the original token value and second part is a new unique value. It must be unique so it is a UUID. The storage system, in its check-write atomic operation checks that the first part of the token matches the resource's current token value. If it matches the the second part of the incoming token is written as the new current token value of the resource (along with the rest of the write operation). If it doesn't match, the write fails.

I think that's correctness where the only assumptions are the uniqueness of the tokens provided by the clients with the write operations and the atomicity of the storage service's check-write operation -- no locking/leasing assumed.

???? (It would be very awkward to put question marks at the end of all the sentences, but please take all this as a question!)

The use of a lock/lease system, then, would be to answer the question: is something already intending to modify the resource? (Which is a question of efficiency, of course.)