Resources for Machine Translation

Corpora

French English reference and post-edited translations for scholarly documents

Several datasets prepared to train and evaluate Machine Translation system dedicated to the translation of scholarly documents in various domains. They have been prepared in the context of the MaTOS (Machine Translation for Open Science) Project and are accessible from the Project Web Site.

Tracking Gender Bias in French-English Machine Translation

This dataset contains a large set of about 3400 gendered occupation nouns (in French) and their translation in English. It has been used in several studies to demonstrate and analyze gender bias in Machine Translation. It has been developed in the course of the NeuroViz projet, a collobaration lead by G. Wisniewski, N. Ballier and L. Zhu (U. Paris-Cité).

The corpus and all the associated documentation and references are on the project web page.

Bio-Medical abstracts with Explicit Document Structure

This corpus contains refactored versions of the test sets used in the Bio-Medical translation shared task for the years 2016, 2017, 2018, 2019, 2020 (French-English). The documents have been downloaded from the main datasource, then realigned at the document and the sentence levels to provide a more suitable format for MT evaluations, which typically expect sentence-aligned source-target pairs. Document Level tests are aligned test sets at the 'sub-heading' level. One empty line indicates sub-headings boundary and two empty lines are used to mark document boundaries.

Train data includes 6 bio-medical corpora tagged with section information as detailed in the following article:

Sadaf Abdul Rauf and François Yvon, "Translating scientific abstracts in the bio-medical domain with structure-aware models", Computer Speech & Language, Volume 87, 2024, ISSN 0885-2308, https://doi.org/10.1016/j.csl.2024.101623.

You can get this corpus on github.

English-French translation of Cochrane's systematic reviews

This corpus contains translation into French of abstracts of systematic reviews edited by the Cochrane. It contains three subparts: the largest are translations performed by Human translators; two smaller subparts have also been produced via post-edition: one corresponds to post-editions of the contemporary Google's (S)MT engine; the other to post-editions of our in-house biomedical SMT engine.

This corpus is fully documented in this paper: Diagnosing High-Quality Statistical Machine Translation Using Traces of Post-Edition Operations (Julia Ive, Aurélien Max, François Yvon, Philippe Ravaud), In International Conference on Language Resources and Evaluation - Workshop on Translation Evaluation: From Fragmented Tools and Data Sets to an Integrated Ecosystem (MT Eval 2016), 2016.

You can get this corpus on github.

Word and sentence alignments

Resources for Sentence Alignments

This corpus was collected during the ANR/TransRead project. It contains: * A collection of gold sentence alignments for 13 novels (mostly en/fr) * A collection of confidence annotations of sentence alignment links * A collection of confidence annotations of word alignment links in 5 language pairs.

You can get this corpus on github.

Hierarchical Word Alignments

This corpus contains manual word alignments for 5 short stories in French aligned with one corresponding translation into English. The peculiarity of this data compared to other sources of word-aligned parallel data, is that the alignement is hierarchical, which means that we have also collected alignments for segments of variable length, annotated from the top level (sentence) to bottom (the word level).

You can get this corpus on github.

References

Yong Xu and François Yvon. 2016. Novel elicitation and annotation schemes for sentential and sub-sentential alignments of bitexts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 628–635, Portorož, Slovenia. European Language Resources Association (ELRA).

Trace corpus of translation errors

This corpus was develop during the French ANR/Trace project. It contains almost 7,000 French to English and 7,000 English to French translations and their post-editions by professionals translators. Download.

This corpus is described in this paper: Design and Analysis of a Large Corpus of Post-Edited Translations: Quality Estimation, Failure Analysis and the Variability of Post-Edition (Guillaume Wisniewski, Anil Kumar Singh, Natalia Segal, François Yvon), In Machine Translation Summit (MT Summit), 2013.

Demos

Writing in two languages

This is a demo of a prototype system for for writing in more than one language. Just like a regular translation interface, but the two boxes are active.

More about this work in the associated publication: BiSync: A Bilingual Editor for Synchronized Monolingual Texts (Crego et al., ACL 2023)

Bilingual reading in the TransRead project

This is a demo of a system for reading texts in two languages, think of it as a numeric version of a bilingual book.

demotransreadV2 from Francois Yvon on Vimeo.

More about this work in the associated publication: TransRead: Designing a Bilingual Reading Experience with Machine Translation Technologies (Yvon et al., NAACL 2016)