We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

heyohai • 3 years ago

Thanks a lot for your impressive introduction!

I am a little confused by the derivation between "Let’s try changing to sup inf" and "We see that the infimum is concave, as required. Because all functions ....".

Could you explain more about why $inf_gamma E_{x,y~gamma}[|x-y|-(f(x)-f(y)]$ is negative infinity when $f$ is not 1-Lipschitz? Thank you!

Arthur Conmy • 1 year ago

aargh! Likewise I'd love a follow up on this.

EDIT: though I think as mentioned below all that matters is that the expectation is negative so we are able to be interested in solely the |f|_L <= 1 functions.

mc • 5 years ago

Those visualizations are beautiful! congrats on this extremely nice intro!
The great news is that, despite what the bibliography of the paper seems to suggest, there are several ML papers that have been proposed that fit between Villani and the Wasserstein GAN paper.

I am happy to plug mine, notably at nips 2016:
and nips 2013:

but there are also many other interesting projects going on that are worth mentioning, and I am only mentioning pure ML/stats papers.

Spencer Yue • 4 years ago

I think there was a small typo at the Strong Duality section in the second display-mode equation (i.e. document.getElementById('MathJax-Element-115-Frame')).

In particular, the [b, -z* + epsilon] should be transposed:

\begin{bmatrix} \mathbf{A} \\ -\mathbf{c}^T \end{bmatrix}^T
\begin{bmatrix} \mathbf{y} \\ \alpha \end{bmatrix} \leq \mathbf{0}, \ \ \ \
\begin{bmatrix} \mathbf{b} \\ -z^* + \epsilon \end{bmatrix}^T
\begin{bmatrix} \mathbf{y} \\ \alpha \end{bmatrix} > 0

Also, thanks for the great article.

Arthur Conmy • 1 year ago

Thank you very much for the enlightening discussion of this discrete case!

The hardest part to follow was the part

Since P_ r and P_\theta are nonnegative ...
this is not the case.

For one thing, it should be reworded as it reads as if \sum_i f_i + g_i is in fact the objective function! But moreover, I didn't see how the benefits would need to be for the same j value. To flesh this argument out, it would only not be possible to reduce the 'gap' f(x_i)+g(x_i) to be closer to 0 if we had j, k such that

f(x_i) + g(x_j) = |i-j|
f(x_k) + g(x_i) = |i-k|

where j and k might not be equal! Is this what was implied, or is there a way to reduce to the j=k case? I cannot see a way.

Then from this we can deduce a contradiction since adding the two, the LHS is less than |j-k| by the D_{jk} result for g(x_j) and f(x_k), but by the triangle inequality the RHS is at least |j-k|.

Vineeth Bhaskara • 1 year ago

Hi. Really nice tutorial. Do you think this is going to be valid when the range of f is R+ (positive reals) instead of R? Thank you

Jacob Stern • 1 year ago

Hi, looks like there is a minor typo in your second code block. In the linprog call for the dual problem, it should be A.T. This corresponds to your description and agrees with your full notebook on github: https://github.com/vincenth...

The Raven Chaser • 1 year ago

Hello Vincent Herrmann ,

Thanks for the illustrative post. When reading the proof of strong duality, I'm confused about why $\hat A x^* =\hat b_0$ for $\epsilon=0$? In other word, why $Ax^*=b$ for the optimal point $x^*$? The constraints say $Ax≤b$, and [this wikipedia](https://en.wikipedia.org/wi... says only a subset of inequality constraints becomes equalities.
Another question is that when $\epsilon>0$, you does not seem to prove that $y≤0$, which is required by the dual.

Sutharsan Mahendren • 1 year ago

Thank you so much for the explanation in both discrete and continues forms. This gives nice insights. Dual problem can also think as profit maximization method in optimal transfer plan where we need to assign buying and selling price by having constrain which is the profit should be less than the initial transfer cost (c). It will again lead to constraint Lipschitz continuity. I understood how this method is formulated from your explanation.

wissam124 • 2 years ago


Thanks a lot for this amazingly clear introduction to the Kantorovich-Rubinstein duality. Really helped me get a better understanding of what it meant.

I believe there are small typos in the Strong Duality section, 3rd paragraph.
Shouldn't it be $\hat{b_{0}}^{T}\hat{y}$ in the first line? And then I believe it is "Then changing to any number greater than has to result in $\hat{b_{\eps}}^{T}\hat{y}$."


Viktoria Malyasova • 2 years ago


Your minimax principle is wrong. Here's a counterexample:

A = [0, 2], B = [0, 1],
g (a, b) = min (|a + b - 1.5| - 0.5, 0)
sup_b g(a, b) = 0 is a convex function,
inf_a g(a, b) = -0.5 is concave function,
inf_a sup_b g(a,b) = 0 isn't equal to
sub_b inf_a g(a,b) = -0.5

The proof is otherwise very clear and accessible, I wonder if there's a way to fix it!

Vincent Herrmann • 2 years ago


I've just gone through it again, and I still think that it's correct. I have added one sentence to make maybe things a little clearer. But it's quite a while ago that I thought about it in depth, so I'm not absolutely sure.

Your function g(a, b) is 0 for all a and b, isn't it? For it to be -0.5, |a + b - 1.5| would have to be zero, which it can't be. Or am I missing something?

Viktoria Malyasova • 2 years ago

Hi, thanks for answering. No, g isn't constant zero. |a + b - 1.5| is zero for all a,b such that a + b = 1.5. It's a segment in AxB, one end is (0.5, 1), the other end is (1.5, 0).

Function g is zero in lower triangle where a + b <= 1, zero in upper triangle where a + b >= 2, and negative in a strip where 1 <= a + b <= 2.

Vincent Herrmann • 2 years ago

Ah, my bad - sorry! I read it as A and B having only the two elements in the list. Too much programming:)
You're right of course, thanks for pointing it out! I don't see exactly what the problem is at the moment, I'll have to think about it. Maybe a missing regularity condition? Because the result is true, it's established...

Vincent Herrmann • 2 years ago

By the way, I know the error now. Of course g(a, b) has to be convex in a for any fixed b (not just at the supremum) and concave in b for any fixed a. However, it's not as easy to show this for the Wasserstein case. I will update the post as soon as possible.

Daniel Suh • 3 years ago

Thanks for this very nice into! Probably the best Wasserstein explanation I've seen :)
May I ask you what tools you have used other than jupyter (+ matplotlib) to create your diagrams?

Vincent Herrmann • 3 years ago

Thank you, Daniel!
I used Adobe Illustrator to edit the plots (Inkscape is a great free alternative) and GeoGebra for the Farkas diagrams.

Ow • 4 years ago

Hello Vincent Herrmann ! It seems that your Mathjax's CDN has been down so all the formulas are in a mess to me. Could you please find some time to fix it ?

Vincent Herrmann • 4 years ago

Hi! For me alls seems to be working fine... what browser are you using?

Ken Caluya • 4 years ago


I was wondering where you got the functions f and g for the dual problem?

jhoang • 4 years ago

thanks for writing this up. I have a question on the last part. How do you show that the additional term is 0 if gamma in pi and infinity otherwise? The part when you try to make gamma independent of pi (the joint distribution)

Vincent Herrmann • 4 years ago

If gamma in pi, then $E_(x,y~gamma)[f(x) - f(y)]$ and $E_(s~p_r)[f(s)] - E_(r~p_theta)[f(r)]$ are the same, so their difference is always zero, no matter what f we choose. If gamma not in pi, then s and x, or r and y are completely independent and we can choose an f that makes the difference +inf. Does this help?

jhoang • 4 years ago

ah yes! I misread the equation, thanks for the quick response!

Yeon Woo Jeong • 5 years ago


Hi, I implemented Approximation EMD implemented with neural network, and gan and wgan implementation on simple and well-known example.

yjhong • 5 years ago

I really was impressed about your article and it really helped me understand kantrovich duality
I have a question about last part of your proof(wasserstein distance)
You going through your proof by introducing 'sup' term into original equation and
swap 'inf' and 'sup' part by showing minimax principle
What i could not get on is that why convexity is met for 'sup'g(a,b) and 'inf'g(a,b) is concave
Can you explain it more detail if you possible?
Thanks in advance

Vincent Herrmann • 5 years ago

I wasn't careful enough when if blended two different proofs here. The terms convex and concave are wrong in this case. What I actually mean is that every local minimum in $a$ for $sup_b g(a, b)$ is also a global minimum, and every local maximum in $b$ for $inf_a g(a, b)$ is also a global maximum. There is another mistake, if we use expectations of the random variable gamma instead of integration over the ordinary variable gamma, then we won't get to $-inf$ if $||f||_L > 1$, but only some value less than 0. The proof still works the same. Does this help?
I will change this part as soon as I can.Thank you for calling my attention to this issue!

yjhong • 5 years ago

Thanks for your reply.
I think the term 'convex and concave' is right since every local minimum of convex function
is global minimum
Thanks really helps me a lot
But i still can not understand 'inf' ove gamma part of last equation.
Why it is 0 when f is 1-Lipschtz function?

Vincent Herrmann • 5 years ago

I'm no expert in convexity, but since there could be many separate global minima, I don't know if we can really call it convex. But it's definitely something very similar.

The definition of the 1-Lipschitz condition is basically |f(x) - f(y)| <= |x-y|, the difference between two values of f cannot be greater than their distance. Then |x-y| - (f(x) - f(y)) >= 0, for all x and y. We get the inf over gamma for example if we choose a probability of 1 at the same single value for x and y. The expectation is then 0.

yongjun Hong • 5 years ago

Thanks now i understands
So now how i understand is that at first you wanna show 'inf''sup' term is changeable to 'sup''inf term.
That`s why you added term sup over 'f' which is every local minimum is a global minimum. So changing 'inf''sup' term to 'sup'inf term can be done.
Next, inf over gamma term achieves 0 in that only 1-Lipschitz f function gives feasible solution.
Am i understanding right?

Vincent Herrmann • 5 years ago

Yeah, I think that's right.
The idea is to first replace the constraint on gamma (gamma \in pi) with an optimization over a function f, then swap the order of inf_gamma and sup_f, and lastly replace the optimization over gamma with an constraint on f (f is 1-Lipschitz continuous).

So we want to change from inf_gamma sup_f to sup_f inf_gamma. This is only possible if both sup_f is "convex" (or whatever we call it) in gamma, and inf_gamma is "concave" in f. We show that this condition is sufficient in an abstract way in the g(a,b) part. In our case the whole function (all the expectations, but without the inf and sup) is g(gamma, f), and it is relatively easy to see that the conditions are fulfilled.

yongjun Hong • 5 years ago

Thanks for summary.
Thank you

Yeon Woo Jeong • 5 years ago

This is how I worked on. It is really hard to understand in continuous case. Almost I gave up and recommended your blogs in the end haha.

Vincent Herrmann • 5 years ago

I appreciate that you stuck with it! Yeah, the bilevel optimization and convex functions are tricky, but I think it's great how quickly you can prove it this way.

Yeon Woo Jeong • 5 years ago

Uhmm, it is hard to get the proof of strong duality. But I did it. Maybe you have more time, you rather add some explanation on it

Vincent Herrmann • 5 years ago

Thanks for all the corrections! Sorry if the errors confused you...
I agree that some of it is quite difficult to grasp. There is always this temptation to make things look easy by making the explanation short. But of course, most of the the time, this just means that the reader has to do everything himself. I will try to think of something to clear it up a bit.

Yeon Woo Jeong • 5 years ago

At weiestrass distance, not dx, y but dx dy

Yeon Woo Jeong • 5 years ago

I found one more error \sum f_i + g_i instead of \sum_i f_i - g_i

Yeon Woo Jeong • 5 years ago

\alpha z^*= \alpha c^Tx^* \geq y^TAx^*=y^Tb>\alpha(z^*-\epsilon)
You'd better use this equation to show \alpha>0

Yeon Woo Jeong • 5 years ago

actually \widetilde{z} > z* - \epsilon, not equal.

Vincent Herrmann • 5 years ago

What I meant is that $\tilde{z} := z* - \epsilon$ is possible. All we want is this one value $\tilde{z}$ that gets close to $z^{*}$, the weak duality takes care of the rest.
I changed the equals sign to a defined sign, maybe this is clearer.

Yeon Woo Jeong • 5 years ago

I got your purpose. Thanks for your kindness.

Zhaolun Zou • 5 years ago

Hi. May I translate this into Chinese?

Vincent Herrmann • 5 years ago

Yes, of course!
Feel free to tell me if you find some errors along the way:)

Zhaolun Zou • 5 years ago

Thanks again for such a great article!
In the first paragraph of section "Linear Programming", I think b should be a m dimensional vector while it's defined as a n dimensional vector. Is that a typo or I'm getting it wrong?

Vincent Herrmann • 5 years ago

Yes, you're absolutely right - that was a typo, I fixed it now.
Damn, it's hard to spot them all... thanks for being so attentive!

Zhaolun Zou • 5 years ago

Thanks a lot! XD

Alex Vig • 5 years ago

MathJax rendering is not rendering inline math. You might want to change the cdn... see this link: https://www.mathjax.org/cdn...

Vincent Herrmann • 5 years ago

Oh, I did not know that. Thanks for telling me!
I changed it and everything seems to be working, at least on my machine (Safari & Chrome).

Alex Vig • 5 years ago

Great - it's working now thanks for the post!

Yeon Woo Jeong • 5 years ago

Hey may I translate these into Korean?