Idiap on LinkedIn Idiap youtube channel Idiap on Twitter Idiap on Facebook
Personal tools
You are here: Home Research Research Groups Speech & Audio Processing

Speech & Audio Processing

Speech processing has been one of the mainstays of Idiap’s research portfolio for many years. Today it is still the largest group within the institute, and Idiap continues to be recognised as a leading proponent in the field. The expertise of the group encompasses statistical automatic speech recognition (based on hidden Markov models, or hybrid systems exploiting connectionist approaches), text-to-speech, and generic audio processing (covering sound source localization, microphone arrays, speaker diarization, audio indexing, very low bit-rate speech coding, and perceptual background noise analysis for telecommunication systems).

Current Group Members

Afsaneh Asaei
Sara Bahaadini
Hervé Bourlard (EPFL Professor)
Marc Ferras
Phil Garner
Pierre-Edouard Honnet
David Imseng
Maria Ivanova
Raphael Ullmann
Alexandros Lazaridis
Srikanth Madikeri
Mathew Magimai Doss
Petr Motlicek
Francisco Pinto
Blaise Potard
Marzieh Razavi
György Szaszak


David Barber
Samy Bengio
Volkan Cevher
Ricardo Chavarriaga
John Dines
Andrzej Drygajlo
Hynek Hermansky
Iain McCowan
José del R.Millán
Pierre Wellner
Fabio Valente

current projects

SUMMA - Scalable Understanding of Multilingual Media
Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of story-lines. The massive growth in the number of broadcast and Internet media channels means that current approaches can no longer cope with the scale of the problem.
SMILE - Scalable Multimodal sign language Technology for sIgn language Learning and assessmEnt
The goal of the proposed project SMILE is to pioneer an assessment system for Swiss German Sign Language (Deutschschweizerische Gebärdensprache, DSGS) using automatic sign language recognition technology.
MALORCA - Machine Learning of Speech Recognition Models for Controller Assistance
One of the main causes hampering the introduction of higher levels of automation in the Air Traffic Management (ATM) world is the intensive use of spoken language as the natural way of communication.
UNITS - Unified Speech Processing Framework for Trustworthy Speaker Recognition
The goal of automatic speaker recognition task is to recognize persons through their voice. Automatic speaker verification is a subtest of speaker recognition task where the goal is to verify or authenticate a person. State-of-the-art speaker verification systems typically model short-term spectrum based features such as mel frequency cepstral coefficients (MFCCs) through a generative model such as, Gaussian mixture models (GMMs) and employ a series of compensation methods to achieve low error rates. This has two main limitations. First, the approach necessitates availability of sufficient training data for each speaker for robust modeling and sufficient test data to apply the series of compensation techniques to verify a speaker. Second, the speaker verification system is prone to malicious attacks such as through voice conversion (VC) system, text-to-speech (TTS) system. The main reason is that the front-end feature and back-end models of speaker verification system, namely, MFCC and GMMs, are similar to that of VC system and TTS system.
SIIP - EU Speaker Identification Integrated Project (SIIP)
OMSI-2015_ARMASUISSE - Objective Measurement of Speech Intelligibility
PHASER - Parsimonious Hierarchical Automatic Speech Recognition
The present project aims at exploiting and integrating in a principled way recent developments in posterior- based speech recognition systems, hybrid HMM/ANN systems, exploiting Hidden Markov Model (HMM) and Artificial Neural Networks (ANN), Deep Neural Networks (a particular form of ANN with deep hierarchical and nonlinear architecture), compressive sensing, sparse modeling and hierarchical sparse coding for ASR.
SP2 - SCOPES Project on Speech Prosody
This is a proposal for a Joint Research Project (JRP) under the SNSF SCOPES mechanism.
SODS - Semantically Self-Organized Distributed Web Search
SCOREL2 - Automatic scoring and adaptive pedagogy for oral language learning
AddG2SU - Flexible Acoustic Data-Driven Grapheme to Subword Unit Conversion
SIWIS - Spoken Interaction with Interpretation in Switzerland
ROCKIT - Roadmap for Conversational Interaction Technologies
Geneemo - An Expressive Audio Content Generation Tool
A-MUSE - Adaptive Multilingual Speech Processing
DEEPSTD-EXT - Universal Spoken Term Detection with Deep Learning (extension)
DeepSTD project is interested in applying deep learning methods for speech processing.

Recent Projects

OMSI_ARMASUISSE - Objective Measurement of Speech Intelligibility
AMI Consortium - Augmented Multiparty Interaction with Distance Access
AMIDA is a new European Commissioned project funded to continue the research begun under AMI.
TA2 - Together Anywhere, Together Anytime
IM2 (Phase three) - Interactive Multimodal Information Management
IM2 is one the 20 Swiss National Centres of Competence in Research (NCCR) aiming at boosting research and development in several areas considered of strategic importance to the Swiss economy. The National Centers of Competence in Research are a research instrument managed by the Swiss National Science Foundation on behalf of the Federal Authorities. Granted for a maximum duration of 12 years, they are evaluated every year by a review panel, and renewed every four years. Success of the NCCRs is measured in terms of research achievements, training of young scientists (PhD students and postdocs), knowledge and technology transfer (including spin-offs), and advancement of women.
SCALE - Speech communication with adaptive learning
FlexASR - Flexible Grapheme-Based Automatic Speech Recognition
There has always been an interest in using directly the grapheme (orthographic) transcription of the word, without explicit phonetic modeling. However, while limiting the variability at the word representation level, the link between the acoustic waveform has become weaker (depending on the language), as the standard acoustic features characterize phonemes. Most recent attempts were based on mapping orthography of the words onto HMM states using phonetic information, or extending conventional HMM-based ASR systems by improving context-dependent modelling for grapheme units.
InEvent - Accessing Dynamic Networked Multimedia Events
The main goal of inEvent is to develop new means to structure, retrieve, and share large archives of networked, and dynamically changing, multimedia recordings, mainly consisting here of meetings, video-conferences, and lectures.
PANDA - Perceptual Background Noise Analysis for the Newest Generation of Telecommunication Systems
MULTIVEO - High Accuracy Speaker-Independent Multilingual Automatic Speech Recognition System


Hervé Bourlard (EPFL Professor)

Document Actions