You are here: Home / SHARED TASK



Pronoun translation poses a problem for MT systems as pronoun systems do not map well across languages, e.g., due to differences in gender, number, case, formality, or humanness, and to differences in where pronouns may be used. Translation divergences typically lead to mistakes in MT output, as when translating the English "it" into French ("il", "elle", or "cela"?) or into German ("er", "sie", or "es"?). What about the translation of null subjects? Null subject languages express person and number within the verb's morphology, rendering a subject pronoun or noun phrase redundant.  How should these instances be translated? One way to model pronoun translation is to treat it as a cross-lingual pronoun prediction task.

We propose such a task, which asks participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We will further provide a lemmatised target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. In the translation, the words aligned to a subset of the source-language third-person subject pronouns are substituted by placeholders. The aim of the task is to predict, for each placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the documents.

The cross-lingual pronoun prediction task will be similar to the PRONOUN translation task held last year at WMT:

Participants are invited to submit systems for the English-to-French, English-to-German, German-to-English and Spanish-to-English language pairs



In the cross-lingual pronoun prediction task, you are given a source-language document with a lemmatised and POS-tagged human-authored translation and a set of word alignments between the two languages. In the translation, the lemmatised tokens aligned to the source-language third-person pronouns are substituted by placeholders. Your task is to predict, for each placeholder, the fully inflected word token that should replace the placeholder from a small, closed set of classes. I.e., to provide the fully inflected target translation of the source pronoun in the context sketched by the lemmatised/tagged target side. You may use any type of information that you can extract from the documents.

Lemmatised and POS-tagged target-language data is provided in place of fully inflected text. The provision of lemmatised data is intended both to provide a challenging task, and to simulate a scenario that is more closely aligned with working with machine translation system output. POS tags provide additional information which may be useful in the disambiguation of lemmas (e.g. noun vs. verb, etc.) and in the detection of patterns of pronoun use.

The pronoun prediction task will be run for the following sub-tasks:

  • English-to-German
  • German-to-English
  • English-to-French
  • Spanish-to-English **** New ****

Details of the source-language pronouns and the prediction classes that exist for each of the above sub-tasks are provided in the following section (below). The different combinations of source-language pronoun and target-language prediction classes represent some of the different problems that SMT systems face when translating pronouns for a given language pair and translation direction.

The task will be evaluated automatically by matching the predictions against the words found in the reference translation by computing the overall accuracy and precision, recall and F-score for each class. The primary score for the evaluation is the macro-averaged F-score over all classes. Compared to accuracy, the macro-averaged F-score favours systems that consistently perform well on all classes and penalises systems that maximise the performance on frequent classes while sacrificing infrequent ones.

The data supplied for the classification task consists of parallel source-target text with word alignments. In the target-language text, a subset of the words aligned to source-language occurrences of a specified set of pronouns have been replaced by placeholders of the form REPLACE_xx, where xx is the index of the source-language word the placeholder is aligned to. Your task is to predict one of the classes listed in the relevant source-target section below, for each occurrence of a placeholder.

The training and development data is supplied in a file format with five tab-separated columns:

  • the class label
  • the word actually removed from the text (may be different from the class label for class OTHER and in some edge cases)
  • the source-language segment
  • the target-language segment with pronoun placeholders
  • the word alignment (a space-separated list of word alignments of the form SOURCE-TARGET, word indices are zero-based)


A single segment may contain more than one placeholder. In that case, columns 1 and 2 contain multiple space-separated entries in the order of placeholder occurrence. A document segmentation of the data is provided in separate files for each corpus. These files contain one line per segment, but the precise format varies depending on the type of document markup available for the different corpora. In the development and test data, the files have a single column containing the ID of the document the segment is part of.

Here is an example line from one of the training data files:

elles	Elles	They arrive first .	REPLACE_0 arriver|VER en|PRP premier|NUM .|.	0-0 1-1 2-2 2-3 3-4

The test set will be supplied in the same format, but with columns 1 and 2 empty, so that each line starts with two tab characters. Your submission should have the same format as column 1 above, so a correct solution would contain the class label elles in this case. Each line should contain as many space-separated class labels as there are REPLACE tags in the corresponding segment. For each segment not containing any REPLACE tags, an empty line should be emitted. Additional tab-separated columns may be present in the submission, but will be ignored. Note in particular that you are not required to predict the second column. The submitted files should be encoded in UTF-8 (like the data we provide).

The training, development and test datasets will be filtered to remove non-subject position pronouns. The filtering for the development and test set will be manually checked to ensure that no non-subject position pronouns remain. For more information, please see the section on data filtering below. 


The following sections describe the set of source-language pronouns and target-language classes to be predicted, for each of the four sub-tasks. The selection of the source-language pronouns and their target-language prediction classes for each sub-task is based on the variation that is possible when translating a given source-language pronoun. For example, when translating the English pronoun "it" into French, a decision must be made as to the gender of the French pronoun, with "il" and "elle" both providing valid options. 

This year, we have decided to include the sub-task of translation from Spanish-into-English. This pair involves the additional difficulty of having to generate the Spanish null subjects into English. The training data follows an identical format to that of the other language pairs. The difference is that the placeholder points to the (zero-based) position of a third person Spanish verb with no overt subject. In the following example, the placeholder REPLACE_3 points to the Spanish verb "es" in the third (zero-based) position. The class to predict is the English pronoun "it".

it    it     de hecho , es posible que marte haya sido habitable en el pasado.    indeed|ADV ,|PUNCT REPLACE_3 be|VERB possible|ADJ that|SCONJ Mars|PROPN be|VERB habitable|ADJ in|ADP the|DET past|NOUN.    0-0 1-0 2-1 3-2 4-4 5-5 6-6 7-7 8-7 9-8 10-9 11-10 12-11

You should *always* predict either a word token or "OTHER". See prediction class lists below for a list of word tokens to predict for each sub-task.



This sub-task will concentrate on the translation of subject position "it" and "they" from English into French. The following prediction classes exist for this sub-task:

ce The French pronoun ce (sometimes with elided vowel as c') as in the expression c'est "it is"
elle Feminine singular subject pronoun
elles Feminine plural subject pronoun
il Masculine singular subject pronoun
ils Masculine plural subject pronoun
cela Demonstrative pronouns. Includes "cela", "ça", the misspelling "ca", and the rare elided form "ç' "
on Indefinite pronoun
OTHER Some other word, or nothing at all, should be inserted


This sub-task will concentrate on the translation of third person Spanish verbs without an overt subject. The following prediction classes exist for this sub-task:

he Masculine singular subject pronoun
she Feminine singular subject pronoun
you Second person pronoun (with both generic or deictic uses)
it Non-gendered singular subject pronoun
they Non-gendered plural subject pronoun
there Existential "there"
OTHER Some other word, or nothing at all, should be inserted


This sub-task will concentrate on the translation of subject position "it" and "they" from English into German. The following prediction classes exist for this sub-task:

er Masculine singular subject pronoun
sie Feminine singular subject pronoun
es Neuter singular subject pronoun
man Indefinite pronoun
OTHER Some other word, or nothing at all, should be inserted


This sub-task will concentrate on the translation of subject position "er", "sie" and "es" from German into English. The following prediction classes exist for this sub-task:

he Masculine singular subject pronoun
she Feminine singular subject pronoun
it Non-gendered singular subject pronoun
they Non-gendered plural subject pronoun
you Second person pronoun (with both generic or deictic uses)
this Demonstrative pronouns (singular). Includes both "this" and "that"
these Demonstrative pronouns (plural). Includes both "these" and "those"
there Existential "there"
OTHER Some other word, or nothing at all, should be inserted

Discussion Group

If you are interested in participating in the shared task, we recommend that you sign up to our discussion group to make sure you don't miss any important information. Feel free to ask any questions you may have about the shared task!!forum/discomt2017-cross-lingual-pronoun-prediction-shared-task


Please note that we have postponed the dates for the release of the test data and for the system submission.




The task is to predict the translation of subject position pronouns for all sub-tasks. In order to ensure fair and accurate evaluation of system performance, filtering will be applied to the source-language texts of the training, development and test datasets, to select only those pronoun instances that are relevant to each sub-task. For example, in the case of English-to-French translation, which focusses on the translation of the English subject position pronouns "it" and "they", only subject position instances of "it" will be included in the development and test datasets.



The training and development datasets can be downloaded from the following locations:

Classification data files: *.data.gz

Document ids (the document to which each sentence belongs): *.doc-ids.gz

Please note that:

  • For the Spanish-to-English task, there are currently two development sets, and The first is the same as for the other language pairs, however it contains just few or even none of the low frequency classes. The second has been manually curated to ensure that several instances of all the classes are included. 
****A problem concerning two sentences has been corrected in the file: two lines had REPLACE placeholders but no class. If you downloaded this data set before April 19th, 2017, you might encounter the problem. A new corrected version is now available****
  • In the current release, manual filtering is applied to both the development and testing sets of all language pairs. The remaining training data (IWSTL15, Europarl and NCv9) has not been filtered. Automatically filtered versions of all the training data for English-to-French, English-to-German and German-to-English can be downloaded from last year's repository:  
  • For the Spanish-to-English task, filtered training data is not provided. This is because Spanish verbs have the property of not expliciting the subject argument only. Therefore, the risk of selecting other arguments such as objects is non-existent.
  • ALERT!!! It has been pointed out to us that we had provided with English-to-Spanish alignments for the the Spanish-to-English training files. We have now provided with a version of the files with Spanish-to-English alignments. The and test files are not concerned with this issue.



Gold test sets and official scorer now available !!!





We provide baseline language models for each sub-task


German-to-English and Spanish-to-English:


To use the baselines you will need to:

  • Download the python script and YAML files from here
  • Download the relevant KenLM binary format file (mono+para.5.xx.lemma.trie.kenlm)
  • Install KenLM
  • Install the KenLM python module


Installing the KenLM python module: If you have pip installed, run pip install Alternatively, after downloading KenLM, run python install

You can get predictions for the baseline model by running, e.g.:

python --fmt=replace --removepos --conf en-fr.yml

These are in the format that the scorer requires, with predictions in the first column, the word it predicted in the second column (which is always ignored by the scorer, so don't worry if your system doesn't predict words), etc.

If you're interested in just using the marginal probabilities for each filler from the language model, you can also use:

python discomt_baseline --fmt=scores --removepos --conf en-fr.yml

which will give you, for each input line, one with TEXT in the second column giving you the source/target text, and zero or more lines with ITEM 0, ITEM 1 etc. giving you a (partial) probability distribution over the fillers for each "REPLACE" position.

Other flags:

  • --lm LANGUAGE_MODEL: use another language model (otherwise it assumes the default name and the current directory)
  • --null-penalty PENALTY: use this penalty for predicting no filler at all (which counts as OTHER)


 Example of evaluation output using the development data for Spanish-English,

<<< I.  EVALUATION >>>
Confusion matrix:
          he   she    it  they   you  there OTHER <-- classified as
             +------------------------------------------+ -SUM-
   he       |    1     0     0     0     0     0     2   |     3
  she      |    0     0     2     0     2     0     3   |     7
   it         |    2     1    43     2     1     8    20 |    77
  they      |    7     0     8     8     3     1    19 |    46
  you       |    3     0     2     1    11     0     4 |    21
   there    |    1     1     8     0     0    17     1 |    28
  OTHER |    5     0    11     1     5     1    33 |    56
  -SUM-    19     2    74    12    22    27    82

Accuracy (calculated for the above confusion matrix) = 113/238 = 47.48%

Results for the individual labels:
       he  :    P =     1/   19 =   5.26%     R =     1/    3 =  33.33%     F1 =   9.09%
      she  :    P =     0/    2 =   0.00%     R =     0/    7 =   0.00%     F1 =   0.00%
       it  :    P =    43/   74 =  58.11%     R =    43/   77 =  55.84%     F1 =  56.95%
     they  :    P =     8/   12 =  66.67%     R =     8/   46 =  17.39%     F1 =  27.59%
      you  :    P =    11/   22 =  50.00%     R =    11/   21 =  52.38%     F1 =  51.16%
     there :    P =    17/   27 =  62.96%     R =    17/   28 =  60.71%     F1 =  61.82%
     OTHER :    P =    33/   82 =  40.24%     R =    33/   56 =  58.93%     F1 =  47.83%

Micro-averaged result:
P =   113/  238 =  47.48%     R =   113/  238 =  47.48%     F1 =  47.48%

MACRO-averaged result:
P =  40.46%    R =  39.80%    F1 =  36.35%


MACRO-averaged R:  39.80%



The predicted pronoun class labels will be automatically evaluated against the gold standard translations from the test set (see the example for the classification baseline below). The current version of the scorer is available here. The script contains instructions detailing how it should be used.

The script computes macro-averaged R scores. The justification for computing macro-averaged R, is that unlike with macro-averaged F1, each error is counted only once. With macro-averaged F1, the same error will be counted twice: once for precision and once again for recall, given that it averages over all prediction classes.




We will provide the input data in the same format as the training data, but with the first two columns empty. Your predictions should be submitted in the format recognized by the official scorer, see above for details. Please e-mail the file with the predictions, labelled with the name of your system, to sharid.loaiciga [at] no later than June 9th, 2017 (any time zone). The following website provides the time Anywhere on Earth (AOE) which you may find useful.

Please note that each participant may submit up to two systems, per task. If you submit more than one system for a given task, please indicate which system is the primary system.




We would like to invite all groups who participated in the shared task to submit a system description paper detailing the design of their systems. System description papers should be 4 pages in length (plus any additional number of pages for references) and should conform to the EMNLP official style guidelines. These guidelines are contained in the style files which can be downloaded from: . Final versions (camera-ready) will be given one more page of content to address the comments of the reviewers. 

System description papers are subject to review. However, the scores or ranking achieved in the shared task evaluation have no bearing on the acceptance decision. We strongly recommend that you write a system description paper and present your system(s) at the workshop no matter how successful your approach was in the evaluation. 

Unlike regular long and short papers, system description papers need not be anonymized.

At minimum, a system paper should:

  • Contain a detailed description of the design of the system(s)
  • Describe the motivation behind the design of each system
  • Contain sufficient detail to make the results reproducible
  • Describe data sets and tools not included in the official data release

In addition, we strongly recommend that the system paper contains:

  • Any additional contrastive results that you feel are appropriate
  • A critical discussion of the system performance
  • Error analyses detailing where the system provided the results that you hoped for, where it didn't, and possible reasons why. This could be provided for the test set, or the development set.


System papers do not need to provide a detailed description of the task itself or the data sets provided by the organizers. Instead, they may refer to the shared task overview paper, the bibliographic details of which will be announced prior to the camera-ready deadline. 

System description papers should be submitted electronically via the START system:


Poster Format

A0 Landscape



The organisation of this task has received support from the following project:

  • Discourse-Oriented Statistical Machine Translation funded by the Swedish Research Council (2012-916)