Outline of research in COMTIS
Objective
The goal of COMTIS is to extend the current statistical machine translation paradigm by modeling intersentential relations. COMTIS involves researchers in human language technology, machine learning, linguistics, and system evaluation, coming from three different groups with extensive contributions to the relevant fields.
Background
Machine translation (MT) has made significant progress in the past decade, but its focus has remained on the translation of sentences considered individually. However, in order to ensure overall coherence throughout a translated text, an MT system must also consider and render correctly the items that depend on intersentential relations. The perceived coherence of a translated text, and therefore its overall quality, are mainly influenced by the following markers: pronouns, verb tense/mode/aspect, discourse connectives, and politeness/style/register. None of these markers can be reliably translated on a pure sentence-by-sentence basis.
Research questions
In COMTIS, linguistic theory and corpus studies will provide the ground for a detailed study of a number of cohesion markers. Methods from corpus linguistics will be used to assess which cohesion
markers have the most impact on the perceived coherence of a translated
text. This will provide information about their most suitable representation, the most robust features for automatic identification, as well as their translation (English/French). Monolingual and parallel corpora will be prepared, to be used as training data or as test suites.
Automatic labeling modules will identify intersentential relations, using surface features and labels inspired from the linguistic studies of cohesion markers, as well as features obtained from joint syntactic parsing and semantic role analysis. New SMT models will be developed and trained over parallel corpora enriched with the labels defined above, based on state-of-the-art phrase-based SMT models extended to exploit intersentential relations. Metrics that assess the improvement in the coherence of MT output will be designed in a principled way. The performance of past systems and of those resulting from COMTIS will be assessed using the new metrics and current sentence-specific ones.

