Adaptive Multilingual Speech Processing

The present project addresses, in a fully integrated research context, two important recent trends in spoken language technologies, more specifically (1) multilingual automatic speech recognition (ASR) and (2) multilingual speech syn- thesis and text-to-speech (TTS), also with a particular emphasis on the convergence of these two research areas (towards unified speech modeling). Over the last decade, ASR and TTS technologies have shown a convergence towards statistical parametric ap- proaches. However, despite this apparent convergence of technologies, the ASR and TTS communities continue to conduct their research in a largely independent fashion, with occasional cross-overs between the two. However, we believe that properly addressing complex multilingual ASR and TTS tasks (including for low-resourced languages), with the goal to improve the robustness and quality of both speech recognition and speech synthesis systems, will require looking at those problems in such an integrated way. Of course, multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and to researchers, but also new opportunities, which should be fully exploited here. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers. The objective of the present project is thus to investigate multiple, related, facets of the multilingual ASR and TTS problems, mainly focusing on the key aspects of cross-language and speaker adaptation, while also primarily focus- ing on those approaches that aim at reducing the gap (and/or closing the perceptual/communication look) between speech recognition and speech synthesis. Seeking funding for three PhD positions over 3 years, and 1 postdoc researcher over 2 years, conducting research in the domain of unified speech modeling, we can summarize the objectives covered by the present project as follows: 1. Multilingual ASR: investigating radically new approaches and techniques to rapidly build up components re- quired for multilingual speech processing. Consequently, our proposed work program aims at (1) lowering the overall costs for data acquisition, (2) reducing the overall data needs, and (3) speeding up the developmental process for building speech processing components. 2. Multilingual TTS: further investigating new statistical TTS approaches (instead of unit concatenation based approaches), with particular emphasis on speaker adaptation (voice conversion) and cross-lingual adaptation and mapping (based on statistical rules or decision trees). 3. Bridging the gap between ASR and TTS: developing new common TTS and ASR models, e.g., based on templates (episodic models), as well as speaker and language adaptation techniques that are common to ASR and TTS; exploring new ways to exploit common ASR and TTS techniques to improve both fields, especially in the context of adaptive multilingual speech processing; investigating new “universal” (multilingual) speech units that can be used in both TTS and ASR and that are more amenable to cross-lingual mapping. Of course, as always done at Idiap, performance of all the resulting systems will be estimated on the standard benchmarks in the fields.
Idiap Research Institute
Swiss National Science Foundation
Oct 01, 2012
Sep 30, 2016