Reading Group - GDR TAL

Posted on Ven, 01 jan 2021 in misc

October 30, 2020

A multilingual view of unsupervised MT

Garcia, Foret, Sellam and Parikh

Devise a model for multilingual unsupevised MT. The main idea is as follows: consider \(L\) languages, and model the joint distribution \(P(x,y,z...)\) (let us assume \(L=3\) for the sake of the argument) based on a multiplicity of monolingual or bilingual corpora. The translation parameters require conditional models so the main objective is a sum

$$ \log E_{y,z} P_{\theta}(x|y,z) + \log E_{x,z} P_{\theta}(y|x,z) + \log E_{x,y} P_{\theta}(z|x,y) $$

where the unobserved source data (source) are handled as latent variables in the model. A major assumption is that we do not need \(y,z\) to generate the translation \(x\), hence $ P_{\theta}(x|y,z)= P_{\theta}(x|y)= P_{\theta}(x|z) = \sqrt{P_{\theta}(x|y)P_{\theta}(x|z)}.

Each term in the summation is lower bounded using the Jensen's inequality, yielding for instance for the first term:

$$ \log E_{y,z} P_{\theta}(x|y,z) \ge \frac{1}{2} E_{y \sim P_\theta(y|x)} \logP(x|y) + \frac{1}{2} E_{y \sim P_\theta(y|z)} \logP(x|z) + E_{(y,z) \sim Y,Z} \log P(y,z) $$

Il est intéressant de voir les deux premiers termes comme des termes de reconstruction après back-translation. It is interesting to see the first two terms as reconstruction after a back-translation.

As all these terms are expectations, one can try to use EM to maximise this bound; during the E step one must compute the posterior of \(y|x\), which is approximated using the sole \(argmax\) (itself an approximation) \(\widehat{y}\); during the \(M-step\) one must optimize in \(\theta\), but instead the authors propose to perform one step of gradient update.