Modeling discourse entities and relations for coherent machine translation

State-of-the-art machine translation (MT) systems, especially statistical but also rule-based ones, operate in a sentence-by-sentence mode, and do not propagate information through the series of sentences that constitute texts. Such a propagation is however helpful, and sometimes even indispensable, to make correct translation choices for words and phrases that depend on previous ones. The goal of MODERN is to model and automatically detect such dependencies, and to study their integration within MT, with the aim of demonstrating improvement in translation quality. The focus of MODERN is on the interplay between referring expressions such as noun phrases and pronouns, which must be coherently translated throughout a text, and discourse relations between sentences, which are often conveyed by explicit connectives that are notoriously difficult to translate. MODERN will study joint computational models of discourse entities and discourse relations in texts, based on linguistic theories and experimental grounding, and the inclusion in such models of automatically generated domain-knowledge related to the discourse entities. MODERN will design and implement these probabilistic models, and integrate them with operational MT systems, both rule-based (Apertium) and statistical (Moses). Particular attention will be paid to the evaluation of MT improvement, studying the effect on human readers of various translation options for discourse entities and connectives, and aiming to optimize MT output in this respect. The MODERN project will focus on four languages -- English, French, German and Dutch -- for which the partners have considerable expertise. Two domains will be used as case studies: Alpine texts from a multilingual corpus of Alpine Club yearbooks (Text+Berg) and texts on environmental legislation and debates extracted from the JRC-Acquis, DGT-Acquis, and Europarl parallel corpora. The MODERN project partially builds upon the COMTIS Sinergia project (2010-2013), which has mainly investigated the automatic translation of discourse connectives, using annotated corpora to train a connective disambiguation system and a connective-aware statistical MT system from English to French, Arabic, German and Italian. The improvement was demonstrated using new connective-specific metrics, and the method has also been tested on pronouns and verbs. Work in the MODERN project will be organized around the following four sub-projects, noted A through D, based respectively at each of the participating institutions: UiL/OTS, UniGe, Idiap, UniZh. These sub-projects are: (A) empirically-based linguistic modeling and psycholinguistic evaluation, focusing on the interface between referential and relational cohesion (3-year PhD), and performing a cognitive evaluation of human and MT output (2-year postdoc); (B) cross-linguistic modeling of EN/FR past tense (4th year PhD); (C) automatic multilingual extraction of discourse entities and relations, with four topics: automatic classification of inter-sentential dependencies for SMT (4th year PhD), joint probabilistic modeling of discourse entities and relations (DE&R, 2-year PhD), SMT decoder enhanced by a DE&R model (2-year postdoc), and evaluation methods for MT coherence (4-month finishing postdoc); (D) integrating domain-specific semantics with caching for text-level MT, focusing on automatically-built ontologies for MT (2-year postdoc) and implementing and evaluating caching for rule-based and statistical MT (3-year PhD).
Application Area - Human Machine Interaction, Machine Learning
Idiap Research Institute
University of Geneva, University of Zurich, University of Utrecht
Swiss National Science Foundation
Aug 01, 2013
Jul 31, 2017