Idiap combines its multi-disciplinary expertise to advance the
understanding of human perceptual and cognitive systems, engaging in
research on multiple aspects of human-computer interaction with
computational artefacts such as natural language understanding and
translation, document and text processing, vision and scene analysis,
multimodal interaction, computational cognitive systems, and methods
for automatically training such systems (see our research efforts in
The overall goals of the Idiap speech processing group are to research and develop robust automatic speech recognition and understanding systems for realistic speaking styles and acoustic conditions. This includes advanced research activities, maintenance of language resources for the training and testing of recognition systems, and development of (multi-lingual) large vocabulary continuous speech recognition systems, as well as real-time prototypes. The group has been involved in speech research projects for several years and is today at the leading edge of technology.
The current speech processing research activities include: atomatic recognition of speech based on phonetic (sub-word) modeling, using spectral-temporal profiles of speech, as well as articulatory features; development and improvement of state-of-the-art speech recognition systems based on hidden Markov models (HMM) and combination of HMM and Artificial Neural Networks (HMM/ANN); multi-stream and multi-band processing/combination; acoustic change detection and clustering, including speaker diarization ("who spoke when?"); pronunciation variants modeling; statistical language modelling; speaker adaptation; speaker source localization, microphone arrays and beamforming; and development of new acoustic features (e.g., posterior based features).
Contact: Hervé Bourlard
Natural language understanding and translation
Language understanding by computers can be thought of as the automatic segmentation of texts or dialogues into units of various granularities, the automatic labeling of these units with tags from a given set, and possibly the identification of relations between units. The levels of analysis pertain to words, phrases, sentences, or whole texts/dialogue, and language understanding often proceeds towards an increased level of abstraction of the linguistic content. Current research directions at Idiap focus on word sense disambiguation and topic identification in spoken dialogues, and on the analysis of relations between sentences and their markers, with potential applications to machine translation of texts.
Contact: Andrei Popescu-Belis
Document and text processing
Text analysis techniques aim at accessing effectively the information contained in large repositories of document collections. These are, still today, the most common way of storing the knowledge relevant to a wide spectrum of human activities like business, education, information, etc. Mostly based on statistical analysis of word occurrences and co-occurrences, text analysis aims at a multiplicity of applications, e.g., Information Retrieval (identification of texts relevant to an information need), Text Categorization (attribution of documents to one or more predefined categories), Summarization (extraction of the most informative segments from documents), etc. The same technques can be applied to collections of multimedia data as well, as these often contain texts that can be extracted automatically, e.g. speech recordings and handwritten texts that can be automatically transcribed, or videos showing captions and other written information.
Contact: Alessandro Vinciarelli
Vision and scene analysis
Computer vision is a research domain aiming at the construction of artificial systems capable of extracting semantic information from visual data, where the visual data ranges from a single image to a collection of images or an array of videos captured by a network of sensors. Typical tasks of computer vision addressed at Idiap include the detection, localisation and tracking of objects, the estimation of their pose, the automatic annotation and indexing of images, and the recognition of events for interpretation of 3D scene content. More specifically, current research directions focus on part-based or pose-parameterized representations for object detection, the development of multi-cue fusion and adaptation algorithms for multi-object tracking, face modeling, the unsupervised identification of activities from large data collections, and the design of human behavior analysis tools for the development of surveillance applications or context-aware systems in intelligent spaces.
Contact: Jean-Marc Odobez, François Fleuret
Many real-world applications such as the prediction of the dominant speaker in a meeting, identification of individuals or robot navigation, require the combination of many modalities. Indeed, performance can be improved tremendously by combining visual information with audio processing and alternative sensors.
Many challenges, both theoretical and practical arise in such a context, due to the necessity to integrate different modalities, each with its own time scale, dynamic range and dimensionality, in a consistent an unique framework. Idiap has acquired extensive expertise in the design of joint statistical models of these modalities through the introduction of latent variables and assumptions of conditional independence.
Contact: François Fleuret
Computational cognitive science
Computational cognitive sciance is a research domin devoted to the development of artificial autonomous systems able to perform tasks requiring cognitive abilities like recognition, learning, understanding and reasoning. Here at Idiap we carry out research on computational cognitive science, addressing several crucial issues: vision-based robot localization, knowledge transfer across modalities and concepts, categorization and online learning. Specifically, current research directions focus on how to build semantic spatial representations that are meaningful for humans and useful for an autonomous agent, the design of online learning algorithms with bounded memory growth and fast training time, how to exploit prior knowledge on visual object hierarchies to learn quickly a new subcategory from few training images via knowledge transfer and how to learn correspondences between multimodal representations of concepts.
Contact: Barbara Caputo