Notes on Optimal Transport for Multilingual Embeddings
Posted on Jeu 01 oct 2020 in Multilinguality, Embeddings
Monge's Problem
Monge's problem is a mapping problem: given two sets of points with a mass \(X= \{x_i, i=1 \dots n\}\) with respective mass \(a_i\) and \(Y=\{y_j, j=1 \dots m\}\) with respective mass \(b_j\) the Monge assignment problem seeks a map \(\Gamma\) between \(X\) and \(Y\) such that (a) \(\Gamma\) minimizes some mapping cost (in Monge's initial problem a distance between points), and (b) \(\Gamma\) satisfies mass conservation properties: all the mass in \(x_i\) should be in the set \(\Gamma[x_i]\): \(\forall j, \sum_{i | \Gamma[x_i] = y_j} a_i = b_j\).
This problem can also be formulated in continuous spaces, with punctual masses \(a\) and \(b\) changed into measures \(\mu_X\) and \(\mu_Y\). \(\Gamma\) is then constrained to minimize the total transportation cost \(\int_X \mu_X(x) c(x, \Gamma(x)) dx\) under the constraint that for all measurable sets \(B\) \(\int_B \mu_Y(y)dy = \int_{\Gamma^{-1}(B)} \mu_X(x) dx\). [Any \(\Gamma\) realizing this is a push forward mapping, denoted \(\Gamma_{\#} \mu_X = \mu_Y\).]
In discrete spaces a related problem is the classical combinatorial assignment problem, which corresponds to \(m=n\) and all masses equal to one. In this case, the optimal mapping is solved with the Hungarian wedding algorithm in \(O(n^3)\). Such methods have been used to perform heuristic word alignments in parallel sentences, assuming some matching cost (eg.\ pointwise mutual information or any reasonnable cooccurrence measure) between points. For multilingual embeddings, we will have \(X\) as the set of embeddings in one language, and \(Y\) the sets of embeddings in another language, and we will need to map one to the other. A nice property is they no longer will need to have the same size.
The relaxation by Kantorovitch relaxes the assumption that \(\Gamma\) is a mapping and consider probabilistic couplings instead - which means that the masses in each \(x\) can be split to be distributed to some \(y\). This yields a probabilistic assignment problem, again subject to marginal constraints.
The transport problem
Basics
In discrete spaces, we define transportation plans \(\Gamma\) as follows:
\(U(p,q)\), the set of matrices satisfying the constraints, is a bounded set with \(m+n\) linear equality constraints. This defines a convex polytope. Another way to look at \(U(p,q)\) is that they contain probability distributions with fixed marginals.
Recall that the Frobenius norm between matrices between matrices \(\Gamma\) and \(M\) is defined as: \(<\Gamma,M> = \operatorname{Tr}(\Gamma^T M) = \sum_{i,j} \Gamma_{ij} M_{ij}\).
Let \(M\) be a cost matrix satisfying the following properties \(M_{ij} = 0\) iff \(i=j\), \(M_{ij} =M_{ji}\) (symmetry), and triangular inequality \(\forall i,j,k M_{i,k} \lt M_{i,j} + M_{j,k}\)
When \(\Gamma\) is in \(U(p,q)\), \(<\Gamma,M>\) can read as an expectation over \(\Gamma()\) of the transportation cost:
This formulation also serves to define the problem in continuous spaces.
Optimal Transport Problem
(Kantorovitch) Optimal transport problem (OTP) between \(p\) and \(q\) (for fixed \(M\)): Find \(\Gamma\) minimizing \(<\Gamma,M>\) subject to marginal constraints \(\Gamma{} \in U(p,q)\). The solution is denoted \(\operatorname{OTP^_*}(p,q)\).
This is a tight relaxation of the Monge problem. When the marginal are uniform distributions over sets of the same size, the solution to the OTP is a one-to-one mapping - no other solution is better (proof in [(PC, p15)]).
OTP is a linear problem (linear objective with linear constraints) that can be solved with the simplex algorithm or variants. The solution is a vertex of the polytope (cannot be an interior point).
Distance induced by the OTP
The solution to the OTP \(d_M(p,q)^*\) is a distance between marginal distributions subject to \(M\) being the positive exponent of a distance matrix.
More precisely, if \(M=D^\alpha\), where \(D\) is a distance matrix, then \(W_{D,\alpha}(p,q) = \operatorname{OTP^*}(p,q)^{1/\alpha}\) is a distance between \(p\) and \(q\).
Symmetry is obvious, \(Diag(p)\) transports \(p\) to itself at cost 0 assuming \(M_{i,i} =0\), likewise if \(M(u,v) >0\) for \(u \neq v\), then the transport cost is strictily positive when \(p \neq q\), transitivity must hold as we can move \(p\) to \(p'\) with \(\Gamma\), \(p'\) to \(p''\) with \(\Gamma'\), so the cheapest way to transport \(p\) to \(p"\) must be cheaper than this. The proof in the discrete case is in PC, p20; for the continuous case we need the glue (sic) lemma.
For \(0<p<1\), this is no longer true, but need to look at \(W_{D,\alpha}(p,q)^{\alpha}\) to get a distance.
Duality The dual problem is to find the maximum of \(E_{\mu_X}(\phi) + E_{\mu_Y}(\psi)\) such that \(\phi(x) + \psi(y) \lt M(x,y)\). This is can be seen by writing the primal problem in matrix form (\(min_x c^t x \text{ st. } Ax = b\)) from which we can identify \(b\) to the concatenation of \(p\) and \(q\), which yields the dual form \(max_y b^Ty \text{ st. } A^Ty =\lt c\). This shows that the objective has the form of a sum of two expectations over \(p\) and \(q\).
The proof in the discrete case is in [PC, p24] and relies on the theory of Lagrange multipliers (in the discrete case, \(\phi\) and \(\psi\) are the Lagrange multipliers).
There is may be a telling analogy. Look at \(\phi\) of a price over just moving \(X\), and \(\psi\) of just moving \(Y\), such that the price of both is no greater than the transportation cost, then the sum of these two is reaches its max at a value that is also attained by an optimal transportation plan. This is discussed at length in [PC, p24-25].
Wasserstein distances
When \(\mathcal{X}\) is a metric space and \(M(x,y) = D^p(x,y)\) for some \(p \ge 1\), the induced distance (or rather its \(p\)^th) root is the p-Wasserstein distance. This is an instance of the more general Earth Mover Distances, which are defined over arbitrary distances.
When \(p=1\) the [Kantorovitch-Rubinstein duality theorem] states that:
where \(F_L\) is the class of all 1-Lipschitz functions on \(\mathcal{X}\).[#klipschitz] In other words, distance between distributions are defined based on the maximal difference of expectations for all variation-bounded functions.
A detailed proof shows that these results derives from an analysis of the dual form, where the constraints immediately yields that the optimal of the dual is such that \(\forall i \phi(x_i) +\psi(x_i) \lt 0\). In fact, in can even be shown that given the positivity of \(p\) and \(q\) the optimum is such that \(\phi(x_i) +\psi(x_i) = 0\), when we add the boundedness constraints.
These definitions generalize to the continuous case, where Wasserstein distances are very important because they define weak distances between measures, which can be used to redefine fundamental notions such as convergence.
Entropic relaxation and fast OTP computation
Sinkhorn distances: \(\forall p,q, \forall \Gamma \in U(p,q), H(\Gamma) \lt H(p) + H(q) = H(pq^T)\), so the inegality is tight.
\(U_\alpha = \{P \in U(p,q) \text{ such that } \operatorname{KL}(P||rc^T) \lt \alpha \}\), (\(\alpha > 0\) otherwise this is meaningless), is the set of transport plans whose entropy is within \(\alpha\) of the independent plan \((p^Tq)\).
Sinkhorn distance: \(d_{M,\alpha} = min_{ \Gamma \in U_\alpha} <\Gamma,M>\)
So this is the original transportation with additional entropic constraints on \(\Gamma\) to keep \(\Gamma\) smooth (\(p^Tq\)) has no nonzero value.
New problem (for \(\lambda >0\)), which yields the dual Sinkhorn divergence:
For each value of \(\alpha\), \(\exists \lambda d_M^\lambda(p,q) = d_M^\alpha(p,q)\) (duality theory). It can be found by varying \(\lambda\) monotonously -- the entropy of \(h(\Gamma)\) is between \(0\) and \(\log(d)\), so when \(\lambda\) goes to \(0\), the second term goes to infinity, and \(h(\Gamma)\) increases. [I have a problem here]
Computing dual Sinkhorn divergences results from an old theorem that says that for each positive matrix \(A\), there is a unique pair of diagonal positive matrices such that \(Diag(u) A Diag(v)\) is doubly stochastic and can be computed by the Sinkhorn fixed point algorithm: assuming we start with uniform \(u\) and \(v\):
with \(K = exp(-\lambda M)\).
Assuming proper stopping criteria. Solve this for large values of \(\lambda\) gets you closer to the actual solution of the initial problem (because the entropy term becomes less and less important).
Example implementation :::python def compute_optimal_transport(M, r, c, lam, epsilon=1e-8): """ Computes the optimal transport matrix and Slinkhorn distance using the Sinkhorn-Knopp algorithm
Inputs:
- M : cost matrix (n x m)
- r : vector of marginals (n, )
- c : vector of marginals (m, )
- lam : strength of the entropic regularization
- epsilon : convergence parameter
Outputs:
- P : optimal transport matrix (n x m)
- dist : Sinkhorn distance
"""
n, m = M.shape
P = np.exp(- lam * M)
P /= P.sum()
u = np.zeros(n)
# normalize this matrix
while np.max(np.abs(u - P.sum(1))) > epsilon:
u = P.sum(1)
# reshape will handle the transposition operations
P *= (r / u).reshape((-1, 1))
P *= (c / P.sum(0)).reshape((1, -1))
return P, np.sum(P * M)
A computational analysis is in http://papers.nips.cc/paper/6792-near-linear-time-approximation-algorithms-for-optimal-transport-via-sinkhorn-iteration
Wasserstein distance as a cost function - Wasserstein GANs
The model
Let \(p\) be the empirical distribution and \(q\) the model distribution. For fixed \(M\) (eg. the usual metric in a metric space), a natural objective would be to make \(q\) as close as \(p\) as possible, and to minimize the Wasserstein distance. This is what WGAN do for a class of generative models where we have observed data points distributed under \(p\), and generated data points that result from a 2-step procedure where we first sample \(z \sim q\) (eg. a white noise), then generate deterministically \(x\) through \(x=g_\theta(z)\) (eg. a neural network for instance). We would like to find the optimal \(\theta\) minimizing the distance between distributions \(p\) and \(f_\theta(q)\).
or equivalently:
where \(F_L\) is the class of all 1-Lipschitz functions from \((\mathcal{X},d)\) into \(\R\).
As \(x=g_\theta(z)\), this program boils down to minimizing
where \(F_L\) is the class of all 1-Lipschitz functions on \((\mathcal{X},d)\).
Assume we have a set of 1-Lipschitz (or \(K\)-Lipschitz) functions parameterized by \(\lambda\), we could approach \(W(p,p_\theta\)) as
Bar some technical conditions, Wasserstein GANs will: 1. Find a appropriate \(\lambda\) to compute the Wasserstein distance 2. Given this \(\lambda\), minimize the distance in \(\theta\). The important mathematical point here is that we can differentiate the Wasserstein distance wrt. \(\theta\) as \(-E_{z \sim q} \nabla_\theta \phi_\lambda(g_{\theta}(z))\) 3. make sure \(f_\lambda\) remains Lipschitz, this is achieved in the original paper by clipping the weights in a fixed \(d\)-dimensional box after each update.
- will roughly correspond to optimizing the GAN's critic / discriminator - find a \(\phi_\lambda\) whose expectation under \(p\) is different from the expectation under \(p_{\theta}\); 2. then updates \(\theta\) to reduce this distance.
Technical details are in the original publication by (Arjosky et al, 2017), easier reads are this post or this one. The weight clipping part seems to be an issue, and alternatives exist for instance https://arxiv.org/abs/1704.00028.
Implementing WGANs
The basic implementation would do the following: initialize \(\theta\), \(\lambda\) for i in range(I): while(! not convergence): sample \(x_1 \dots x_n\) (data points), sample \(z_1 \dots z_m\) (generated samples) compute \(F = \frac{1}{n} \phi_{\lambda} (x_i) - \frac{1}{m} \phi_{\lambda} (g_\theta(z_i))\) compute \(\nabla_\lambda F\) update \(\lambda\) \(\operatorname{clip}(\lambda)\) compute \(\nabla_\theta\)(g) = - \frac{1}{m} \sum \nabla_\theta (\phi_\lambda(g_{\theta}(z_i)))\( update $\theta\) done
Rq. In the sampling step, does it matter to sample the same number of points ? In the second estimation, can we reuse the same data points ? Does it matter ? We can have more of one than of the other, that would yield better estimates of the expectation.
Application to multilingual embeddings
The Wassertstein-GAN model is used to trained multilingual embeddings in a supervised fashion in (Meng Zhang et al, Proc EMNLP 2017). In this application, the setting is a bit different from the typical WGAN as the starting point is a pair of sets of monolingual embeddings \(X_S\) and \(X_T\) in respectively the source and target languages. We still assume that we would like to find a transform from one into the other such that the Wasserstein distance between \(X_S\) and \(g_\theta(X_T)\). The authors seem to closely follow the approach of (Arjosky et al, 2017), and do not need to pair source and target words in the first step (computation of the Wasserstein distance). The second part only uses one language and also no pairing.
Follow-up Questions:
1) the basic procedure is asymmetric, and does not require supervision. Which is very nice, but probably not sufficiently put forward. Alternative approaches assume supervision - can we use it ? This would mean more constraints on the transportation map.
The authors mention heuristic sparsification of the transportation matrix. I am assuming that changing the regularizer from entropic to L^2 would do exactly this. We would have a extra learnable parameter to control the sparsity level.
The problem of finding the cost matrix given \(c\) is also interesting. This is an inverse problem studied for instance in [https://jmlr.csail.mit.edu/papers/volume20/18-700/18-700.pdf], and it may be somehow more interesting than the initial problem, provided we express the distance function as
2) the system does not really use masses, whereas we could fairly easily weight words with their frequencies or log-frequencies, and use this in sampling. How should we do this ? use unigram distributions when sampling ? That seems to be the minimum. Would the effect move frequent word -> frequent word ?
3) Could we do it with a word model ? That is without computing embeddings for a fixed set ? The answer is yes and if we have a way to generate character strings, we could do it much more easily. Think about it.
4) How good would this be for word alignments ?
5) How to recover the transportation Map ? Given \(\theta\), we have two sets of points, and a function \(G\) that maps one set to the other. One would then need to solve the primal problem (or a relaxed version) to get the transportation map.
6) applications - MT Evaluation with METEOR like stuff
Additional readings on the same issue (precursor): - Meng Zhang, Yang Liu, Huanbo Luan, Yiqun Liu, and Maosong Sun. Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover's Distance Regularization. In Proceedings of COLING, 2016. [paper] --> https://zmlarry.github.io/publications/coling2016.pdf - Meng Zhang, Yang Liu, Huanbo Luan, Maosong Sun, Tatsuya Izuha, and Jie Hao. Building Earth Mover's Distance on Bilingual Word Embeddings for Machine Translation. In Proceedings of AAAI, 2016. [paper] --> https://zmlarry.github.io/publications/aaai2016.pdf - Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Earth Mover's Distance Minimization for Unsupervised Bilingual Lexicon Induction. In Proceedings of EMNLP, 2017. [paper][code] --> https://zmlarry.github.io/publications/emnlp2017.pdf And follower: - Joulin Grave Berthet http://proceedings.mlr.press/v89/grave19a.html
Side note on the Procrutes part La partie Procrustes est plus claire, même si il manque encore des détails à bien comprendre. Ce que le Procrustes normal supervise est une matrice binaire remplacée dans l'équation par une matrice de transport. Du point de vue mathématique, on cherche G une rotatation telle que
On utilise la SVD de \(V^T T U = A \Sigma B^T\) et les propriétés de la trace pour conclure que le minimum est aussi celui de \(\operatorname{Trace}(A \Sigma B^TG) =\operatorname{Trace}(\Sigma B^TGA)\) qui est atteint quand \(B^TGA\) est l'identité, soit \(G=BA^T\). On retrouve le problème de Procrustes habituel quand \(T_{u,v}\) est la matrice diagonale qui supervise les appariemments mot à mot.
More thoughts on (3): if we have a word generating model eg w \sample unigram (char) g(w) = fastext(w) then I can try to do the same. This has less impact
Wasserstein auto-encoders
The crux of the argument is to study the case where the model
Sources and references
G.Peyré + M.Cuturi handbook
Henceforth PC, very complete, contains the basic proofs and necessary material to understand most works.
A nice blog from Raboud scholars (with code) that helps to explain Wasserstein GANs:
Mathematical fundations
http://www.math.cmu.edu/~mthorpe/OTNotes
Footnotes
[#lipschitz] Note that when \(\phi\) is K-Lifschitz, \(\frac{phi}{K}\) is 1-Lifschitz, so we can actually generalize somehow the next results.