On-demand Knowledge for Document-level Machine Translation

Statistical and neural machine translation systems (in short, SMT and NMT) have reached significant quality and speed levels. Such systems use large amounts of monolingual and bilingual data to train their models and tune their meta-parameters. As a result, translating a sentence from a language unknown to a user can be done with acceptable quality, and hence a clearly perceived utility. However, the translation of complete texts is still far from publishable, and requires substantial post-editing by humans. One reason for this difference is that certain linguistic constraints cannot be reliably translated using only local information, especially when they apply across different sentences. The homogeneous models used by SMT or NMT systems - very large translation tables or connection weights - are a strength for robust and quick sentence translation, but impose a strong limitation when constraints of different ranges must be taken into account to translate a document. In recent years, I have pioneered a method to address document-level problems that degrade MT quality, drawing on specific linguistic knowledge as required by each problem. These methods improved the translation of discourse connectives or verb tenses based on sentence-level or document-level semantic features, or constrained the choice of referring expressions such as pronouns and noun phrases. For implementation, the methods took advantage of existing approaches, such as factored models, to integrate linguistic knowledge with SMT. However, integrating several solutions for dealing with document-level constraints into a unified system is not tractable with current approaches, due to the fact that knowledge sources are heterogeneous and are not tightly coupled with the MT systems. Moreover, the quality improvements brought by leveraging several distinct knowledge sources may not add up because of interaction between them. Finally, the need for computing all features for all words or sentences of a document raises strong efficiency issues. In the DOMAT project, we aim to design a novel approach for providing on-demand linguistic knowledge to statistical or neural MT systems. Both types of systems will be considered to provide comparison terms, as they both have strengths and weaknesses. The linguistic knowledge will be learned by specific processing modules, which will extract and output features in a format that is usable by SMT and NMT systems. To make this architecture operational, we will explore strategies to trigger the modules, for instance based on quality estimation or translation confidence. To populate the architecture, we will build several modules to extract document-level features that are relevant to translation, principally document structure (discourse relations) and coreference (including pronominal anaphora). The starting points for these modules will be our previous achievements in document-level SMT. The DOMAT project will mainly support two PhD theses at Idiap/EPFL, one on designing and comparing statistical and neural architectures for integrating and triggering on-demand knowledge sources in MT, and the other one designing such knowledge sources, which learn specific text-level constraints and output suitable data structures for NMT. The solutions developed in DOMAT will make tractable the demands for adequate, 07.04.2017 08:34:45 Page - 5 - fluent and efficient translation of large documents, and will result in a principled approach for learning high-level linguistic knowledge to improve translation quality.
Idiap Research Institute
HES-SO Master - Vaud
Swiss National Science Foundation
Oct 01, 2018
Sep 30, 2022