Reading Group - GDR TAL

Posted on Ven, 01 jan 2021 in misc

October 30, 2020

A multilingual view of unsupervised MT

Garcia, Foret, Sellam and Parikh

Devise a model for multilingual unsupevised MT. The main idea is as follows: consider $L$ languages, and model the joint distribution $P(x,y,z...)$ (let us assume $L=3$ for the sake of the argument) based on a multiplicity of monolingual or bilingual corpora. The translation parameters require conditional models so the main objective is a sum

$$ \log E_{y,z} P_{\theta}(x|y,z) + \log E_{x,z} P_{\theta}(y|x,z) + \log E_{x,y} P_{\theta}(z|x,y) $$

where the unobserved source data (source) are handled as latent variables in the model. A major assumption is that we do not need $y,z$ to generate the translation $x$, hence $ P_{\theta}(x|y,z)= P_{\theta}(x|y)= P_{\theta}(x|z) = \sqrt{P_{\theta}(x|y)P_{\theta}(x|z)}.

Each term in the summation is lower bounded using the Jensen's inequality, yielding for instance for the first term:

$$ \log E_{y,z} P_{\theta}(x|y,z) \ge \frac{1}{2} E_{y \sim P_\theta(y|x)} \logP(x|y) + \frac{1}{2} E_{y \sim P_\theta(y|z)} \logP(x|z) + E_{(y,z) \sim Y,Z} \log P(y,z) $$

Il est intéressant de voir les deux premiers termes comme des termes de reconstruction après back-translation. It is interesting to see the first two terms as reconstruction after a back-translation.

As all these terms are expectations, one can try to use EM to maximise this bound; during the E step one must compute the posterior of $y|x$, which is approximated using the sole $argmax$ (itself an approximation) $\widehat{y}$; during the $M-step$ one must optimize in $\theta$, but instead the authors propose to perform one step of gradient update.

The Linformer

The Linformer approach of Wang et al (2020) rests on the observation that the computation performed by heads can be approximated by the product of two low ranks matrices.

Furthermore, these low rank matrices can be obtained by introducing two random matrices for each head, one to project the $V^{kl} I^{l-1}$ term ($T \times{} d_k)$ into a $(S \times{} d_k)$ matrix (through the multiplication by a $S \times T$ matrix), the other to project $K^{kl} I^{l-1}$ also into a $(S \times d_k )$ matrix. As a result, the term in the $\operatorname{softmax}$ outputs a $T\times S$ matrix (instead of $T \times t$). By choosing $T \gg S$, we get the announced complexity reduction, at almost zero cost in terms of performance. As in other papers, the authors show that parameter sharing (here sharing the projection matrices across layers) can also help speed up the computation without harming performance (too much).