Robust face tracking, feature extraction and multimodal fusion for audio-visual speech recognition

Human communication is a combination of speech and non-verbal behavior. A significant part of the non-verbal information is contained in face movements and expressions. Therefore, a major step in the automatic analysis of human communication is the location and tracking of human faces. In this project, we will first tackle the problem of robust face tracking, that is, the continuous estimation of the head pose and of the facial animations in video sequences. Based on this first development, two subsequent workpackages will address important building blocks towards the automatic analysis of natural scenes, namely automatic audio-visual speech recognition and Visual Focus of Attention (VFOA) analysis. Both of them strongly rely on robust face tracking and therefore will directly exploit and benefit from the results of the first workpackage.

Our research in face tracking will rely on 3D deformable models learned from training data, which have shown their efficiency at modeling individual face shapes and expressions and at handling self-occlusions. We will address recurrent issues in the domain (strong illumination variations, tracking near profile views, automatic initialization and reinitialization) by investigating three main points: memory-based appearance learning, which aims at building face-state dependent mixture appearance models from past tracking observations; a multi-feature face representation, by combining stable semantic structural points located around facial attributes (eyes, mouth), opportunistic sparse texture and interest points distributed throughout the face and in particular on regions with less predictable appearance (head sides), and dynamic features (head profiles); and an hybrid fitting scheme combining discriminant approaches for fast feature localization, matching for distant 3D (rigid) registration, and iterative approaches for precise model estimation.

Human speech perception is bimodal in nature, as we unconsciously combine both audio and visual information to decide what has been spoken. Therefore, in the second workpackage we consider the audio and visual dimensions of the problem and develop techniques exploiting both modalities and their interaction. Building on our previous works, in this project we focus on visual feature extraction and audio-visual integration in realistic situations. The work is divided in three main tasks: exploitation of single-view video sequences, of multiple-view sequences and application to a real-world task. We will first address the problem of non-ideal lighting conditions, image sequences where people suddenly move, turn their heads or occlude their mouths. Our work will address the extraction of optimal visual features, the estimation of their reliability and their dynamic combination with the audio stream for speech recognition. The second task involves multi-view sequences to extract more robust and reliable visual features. Finally, the developed techniques will be applied to audio-visual speech recognition in cars in different real situations.

Considering the third workpackage, gaze is recognized as one of the most important aspects of non-verbal communication and social interaction, with functions such as establishing relationships through mutual gaze, regulating the course of interaction, expressing intimacy or social control. Exploiting again the results of the first workpackage, we will develop probabilistic models mapping visual information like head pose or orientation into gazing directions. Two main research threads will be explored. The first one will rely on computer vision techniques to obtain gaze measurements from the eye region in addition to the head pose measurements and inferring their contribution to the estimation of the gazing direction. The second thread will investigate the development of gaze models involving the coordination between head and gaze orientations by exploiting the empirical findings made in behavioral studies of alert monkeys and humans describing the contribution of head and eyes movements to gaze shifts. Building on our previous work, the gaze system will be exploited to identify different gazing gestures and human attitudes in dynamic human-human communication settings, like establishing a relation through eye-contact or averting gaze.

As we can see, in this project we will address three fundamental technical components towards automatic human-to-human communication analysis. The project will be an important technical contribution to both the emerging field of social signal processing, which aims at the development of computational models for machine understanding of communicative and social behavior, and human computing, which seeks to design human-centered interfaces capable of seamless interaction with people.

Application Area - Human Machine Interaction, Perceptive and Cognitive Systems
Idiap Research Institute
Swiss National Science Foundation
Jan 01, 2011
Mar 31, 2014