jethkl 2 days ago

Wasserstein distance (Earth Mover’s Distance) measures how far apart two distributions are — the ‘work’ needed to reshape one pile of dirt into another. The concept extends to multiple distributions via a linear program, which under mild conditions can be solved with a linear-time greedy algorithm [1]. It’s an active research area with applications in clustering, computing Wasserstein barycenters (averaging distributions), and large-scale machine learning.

[1] https://en.wikipedia.org/wiki/Earth_mover's_distance#More_th...

ForceBru 2 days ago

Is the Wasserstein distance useful for parameter estimation instead of maximum likelihood? BTW, maximum likelihood basically estimates minimum KL divergence. All I see online and in papers is how to _compute_ the Wasserstein distance, which seems to be pretty hard in itself. In 1D, this requires computing a nasty integral of inverse CDFs when p!=1. Does it mean that "minimum Wasserstein estimation" is prohibitively expensive?

  • 317070 2 days ago

    It is.

    But!

    Wasserstein distances are used instead of a KL inside all kinds of VAE's and diffusion models, because while the Wasserstein distance is hard to compute, it is easy to make distributions whose expectation is the gradient wrt to the Wasserstein distance. So you can easily get unbiased gradients, and that is all you need to train big neural networks. [0] Pretty much any time you sample from your current and the target distribution and take the gradient of the distance between the points, you will be minimizing a Wasserstein distance.

    [0] https://arxiv.org/abs/1711.01558

  • JustFinishedBSG a day ago

    Wasserstein itself is expensive but you can instead optimize arbitrarily close entropic regularizations of it ( Sinkhorn algorithm) that are both easy to optimize and differentiable