Idiap on LinkedIn Idiap youtube channel Idiap on Twitter Idiap on Facebook
Personal tools
You are here: Home Research Research Groups Speech & Audio Processing

Speech & Audio Processing

Speech processing has been one of the mainstays of Idiap’s research portfolio for many years. Today it is still the largest group within the institute, and Idiap continues to be recognised as a leading proponent in the field. The expertise of the group encompasses statistical automatic speech recognition (based on hidden Markov models, or hybrid systems exploiting connectionist approaches), text-to-speech, and generic audio processing (covering sound source localization, microphone arrays, speaker diarization, audio indexing, very low bit-rate speech coding, and perceptual background noise analysis for telecommunication systems).

Current Group Members

Hervé Bourlard (EPFL Professor)
Afsaneh Asaei
Subhadeep Dey
Pranay Dighe
Marc Ferras
Phil Garner
Weipeng He
Pierre-Edouard Honnet
David Imseng
Alexandros Lazaridis
Srikanth Madikeri
Mathew Magimai Doss
Petr Motlicek
Hannah Muckenhirn
Pedro Henrique Oliveira Pinheiro
Dhananjay Ram
Marzieh Razavi
Ajay Srinivasamurthy
Sibo Tong
Yang Wang


Sara Bahaadini
David Barber
Samy Bengio
Volkan Cevher
Ricardo Chavarriaga
John Dines
Andrzej Drygajlo
Hynek Hermansky
Maria Ivanova
Iain McCowan
Francisco Pinto
Blaise Potard
José del R.Millán
György Szaszak
Raphael Ullmann
Pierre Wellner
Fabio Valente

current projects

ELEARNING-VALAIS_3.0 - eLearning-Valais 3.0
Le projet eLearning-Valais 3.0. a l’ambition de développer et d’implémenter des solutions innovantes pour favoriser l’apprentissage dans l’enseignement et augmenter l’employabilité.
ESGEM - Enhanced Swiss German mEdia Monitoring
The aim of ESGEM is to significantly enhance Swiss media monitoring by accommodating Swiss German dialect broadcasts and turning them into searchable text.
MALORCA - Machine Learning of Speech Recognition Models for Controller Assistance
One of the main causes hampering the introduction of higher levels of automation in the Air Traffic Management (ATM) world is the intensive use of spoken language as the natural way of communication.
OMSI-2015_ARMASUISSE - Objective Measurement of Speech Intelligibility
PHASER - Parsimonious Hierarchical Automatic Speech Recognition
The present project aims at exploiting and integrating in a principled way recent developments in posterior- based speech recognition systems, hybrid HMM/ANN systems, exploiting Hidden Markov Model (HMM) and Artificial Neural Networks (ANN), Deep Neural Networks (a particular form of ANN with deep hierarchical and nonlinear architecture), compressive sensing, sparse modeling and hierarchical sparse coding for ASR.
SIIP - EU Speaker Identification Integrated Project (SIIP)
SMILE - Scalable Multimodal sign language Technology for sIgn language Learning and assessmEnt
The goal of the proposed project SMILE is to pioneer an assessment system for Swiss German Sign Language (Deutschschweizerische Gebärdensprache, DSGS) using automatic sign language recognition technology.
SUMMA - Scalable Understanding of Multilingual Media
Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of story-lines. The massive growth in the number of broadcast and Internet media channels means that current approaches can no longer cope with the scale of the problem.
UNITS - Unified Speech Processing Framework for Trustworthy Speaker Recognition
The goal of automatic speaker recognition task is to recognize persons through their voice. Automatic speaker verification is a subtest of speaker recognition task where the goal is to verify or authenticate a person. State-of-the-art speaker verification systems typically model short-term spectrum based features such as mel frequency cepstral coefficients (MFCCs) through a generative model such as, Gaussian mixture models (GMMs) and employ a series of compensation methods to achieve low error rates. This has two main limitations. First, the approach necessitates availability of sufficient training data for each speaker for robust modeling and sufficient test data to apply the series of compensation techniques to verify a speaker. Second, the speaker verification system is prone to malicious attacks such as through voice conversion (VC) system, text-to-speech (TTS) system. The main reason is that the front-end feature and back-end models of speaker verification system, namely, MFCC and GMMs, are similar to that of VC system and TTS system.

Recent Projects

IM2 (Phase three) - Interactive Multimodal Information Management
IM2 is one the 20 Swiss National Centres of Competence in Research (NCCR) aiming at boosting research and development in several areas considered of strategic importance to the Swiss economy. The National Centers of Competence in Research are a research instrument managed by the Swiss National Science Foundation on behalf of the Federal Authorities. Granted for a maximum duration of 12 years, they are evaluated every year by a review panel, and renewed every four years. Success of the NCCRs is measured in terms of research achievements, training of young scientists (PhD students and postdocs), knowledge and technology transfer (including spin-offs), and advancement of women.
TA2 - Together Anywhere, Together Anytime
A-MUSE - Adaptive Multilingual Speech Processing
ADDG2SU - Flexible Acoustic Data-Driven Grapheme to Subword Unit Conversion
Current state-of-the-art automatic speech recognition (ASR) systems commonly use hidden Markov models (HMMs), where phonemes (phones) are assumed to be the intermediate subword units and each word to be recognized is explicitly modeled as a sequence of phonemes. Thus, despite availability of sophisticated statistical modeling or machine learning techniques, to develop an ASR system one requires prior knowledge, such as lexical resources (e.g., phoneme set, lexicon) and some minimum phonetic expertise.
ADDG2SU_EXT - Flexible Acoustic Data-Driven Grapheme to Subword Unit Conversion
AMI Consortium - Augmented Multiparty Interaction with Distance Access
AMIDA is a new European Commissioned project funded to continue the research begun under AMI.
DEEPSTD-EXT - Universal Spoken Term Detection with Deep Learning (extension)
DeepSTD project is interested in applying deep learning methods for speech processing.
FlexASR - Flexible Grapheme-Based Automatic Speech Recognition
There has always been an interest in using directly the grapheme (orthographic) transcription of the word, without explicit phonetic modeling. However, while limiting the variability at the word representation level, the link between the acoustic waveform has become weaker (depending on the language), as the standard acoustic features characterize phonemes. Most recent attempts were based on mapping orthography of the words onto HMM states using phonetic information, or extending conventional HMM-based ASR systems by improving context-dependent modelling for grapheme units.
Geneemo - An Expressive Audio Content Generation Tool
InEvent - Accessing Dynamic Networked Multimedia Events
The main goal of inEvent is to develop new means to structure, retrieve, and share large archives of networked, and dynamically changing, multimedia recordings, mainly consisting here of meetings, video-conferences, and lectures.
MULTIVEO - High Accuracy Speaker-Independent Multilingual Automatic Speech Recognition System
OMSI_ARMASUISSE - Objective Measurement of Speech Intelligibility
PANDA - Perceptual Background Noise Analysis for the Newest Generation of Telecommunication Systems
ROCKIT - Roadmap for Conversational Interaction Technologies
SCOREL2 - Automatic scoring and adaptive pedagogy for oral language learning
SIWIS - Spoken Interaction with Interpretation in Switzerland
SODS - Semantically Self-Organized Distributed Web Search
In this project we wish to develop a new search engine distributed over available web servers, in contrast to existing search engines centralized at a single company site.
SP2 - SCOPES Project on Speech Prosody
This is a proposal for a Joint Research Project (JRP) under the SNSF SCOPES mechanism.
SCALE - Speech communication with adaptive learning

Speech Processing Group News

Idiap has a new opening for a postdoc position in automatic speaker recognition Jan 19, 2017
The Idiap Research Institute seeks a qualified candidate for postdoctoral position in automatic speaker recognition.
Idiap has a new opening for a postdoc position in automatic spam call detection Jan 18, 2017
The Idiap Research Institute seeks a qualified candidate for postdoctoral position in automatic spam call detection.
Idiap Submission to the NIST SRE 2016 Speaker Recognition Evaluation, October 2016 Nov 15, 2016
In October 2016, National Institute of Standards and Technology (NIST), USA, has organized the Speaker Recognition (SRE) evaluation, as the one of ongoing series of speaker recognition system evaluations conducted by NIST since 1996.
Idiap has a new opening for a Post-doctoral position in automatic speech recognition Nov 08, 2016
The Idiap Research Institute invites applications for post-doctoral position in automatic speech recognition. The position is funded by a new industrial project with a leading credit card company in Switzerland. The research and development project will focus on combining technologies of speech recognition with speaker verification. The research will be carried out in a collaboration with other (i.e. European H2020) projects already running at the Idiap research institute.
The security of sensitive information: Prof. Hervé Bourlard interviewed for RTS’ radio show “CQFD” May 11, 2016
How is it possible for people with bad intentions to get access to data from our smartphone or our GPS?


Hervé Bourlard (EPFL Professor)

Document Actions