A short introduction to optimal transport and Wasserstein distance (2020)
4 comments
·August 21, 2025jethkl
Wasserstein distance (Earth Mover’s Distance) measures how far apart two distributions are — the ‘work’ needed to reshape one pile of dirt into another. The concept extends to multiple distributions via a linear program, which under mild conditions can be solved with a linear-time greedy algorithm [1]. It’s an active research area with applications in clustering, computing Wasserstein barycenters (averaging distributions), and large-scale machine learning.
[1] https://en.wikipedia.org/wiki/Earth_mover's_distance#More_th...
ForceBru
Is the Wasserstein distance useful for parameter estimation instead of maximum likelihood? BTW, maximum likelihood basically estimates minimum KL divergence. All I see online and in papers is how to _compute_ the Wasserstein distance, which seems to be pretty hard in itself. In 1D, this requires computing a nasty integral of inverse CDFs when p!=1. Does it mean that "minimum Wasserstein estimation" is prohibitively expensive?
317070
It is.
But!
Wasserstein distances are used instead of a KL inside all kinds of VAE's and diffusion models, because while the Wasserstein distance is hard to compute, it is easy to make distributions whose expectation is the gradient wrt to the Wasserstein distance. So you can easily get unbiased gradients, and that is all you need to train big neural networks. [0] Pretty much any time you sample from your current and the target distribution and take the gradient of the distance between the points, you will be minimizing a Wasserstein distance.
This is very helpful for understanding generative AI. See for example the amazing lectures of Stefano Ermon for Stanford's CS236 Deep Generative Models [1]. All lectures are available on YouTube [2].
[1] https://deepgenerativemodels.github.io/
[2] https://youtube.com/playlist?list=PLoROMvodv4rPOWA-omMM6STXa...