You are here: Home / Methodology


In this page you can find pointers to the technical details on the automatic estimation of the 3D gaze direction from remote consumer RGBD cameras and to some additional resources.

- Goals and pipeline overview

- Gaze estimation from multimodal Kinect data

- Person Independent 3D Gaze Estimation From Remote RGB-D Cameras

- A Semi-Automated System for Accurate Gaze Coding in Natural Dyadic Interactions

- EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras

- Geometric Generative Gaze Estimation (G3E) for Remote RGB-D Cameras

- A Multimodal Corpus for Analysis of Conversational Behaviours in Group Interviews

Goal and setup

The main goal of this project is the estimation of the 3D gaze direction under head motion and minimal user cooperation (i.e. explicit strategies to facilitate the gaze estimation task, like using head mounted hardware, having the user gaze at a set of points prior to the estimation or maintain a particular head pose, etc). For this reason, we use remote sensors (not head mounted) and our research aims at tackling the important challenges that arise from these requirements.

Our general setup can be seen in Figure 1. The 3D gaze direction is defined as the vector which points from the 3D position of the eyeball's fovea to the 3D position of the visual target, which may or may not be within the camera's field of view. As sensor we use consumer RGBD sensors like the Kinect™. A multimodal processing of RGBD data can be beneficial as depth information is highly valuable to estimate the head pose (required to estimate gaze) and the visual domain (RGB) contains the appearance information needed to retrieve the eyeball orientation.

General processing pipeline

Figure 2 shows the general processing pipeline we use, which can be summarized as follows:

a) A personalized 3D face template is created offline by fitting a 3D Morphable Model to a set of face samples

b) Given the personalized 3D face template the head pose is tracked frame by frame using rigid ICP

c) Using the head pose parameters, the face appearance is rendered using the inverse rigid transformation resulting in frontal version of the face image

d) The eye images are cropped and compared to a frontal gaze appearance model. The obtained gaze direction is transformed using the head pose parameters

Figure 1) Setup


Figure 2) Processing pipeline


We built on top of this pipeline, and address the many challenges of this task. The contributions below reports the specific contributions of this project.


Gaze estimation from multimodal Kinect data



This paper addresses the problem of free gaze estimation under unrestricted head motion. More precisely, unlike previous approaches that mainly focus on estimating gaze towards a small planar screen, we propose a method to estimate the gaze direction in the 3D space. In this context the paper makes the following contributions: (i) leveraging on Kinect device, we propose a multimodal method that rely on depth sensing to obtain robust and accurate head pose tracking even under large head pose, and on the visual data to obtain the remaining eye-in-head gaze directional information from the eye image; (ii) a rectification scheme of the image that exploits the 3D mesh tracking, allowing to conduct a head pose free eye-in-head gaze directional estimation; (iii) a simple way of collecting ground truth data thanks to the Kinect device. Results on three users demonstrate the great potential of our approach.

[Paper ] [Code ] [Demo ]

Person Independent 3D Gaze Estimation From Remote RGB-D Cameras

We address the problem of person independent 3D gaze estimation using a remote, low resolution, RGB-D camera. The approach relies on a sparse technique to reconstruct normalized eye test images from a gaze appearance model (a set of eye image/gaze pairs) and infer their gaze accordingly. In this context, the paper makes three contributions: (i) unlike most previous approaches, we exploit the coupling (and constraints) between both eyes to infer their gaze jointly; (ii) we show that a generic gaze appearance model built from the aggregation of person-specific models can be used to handle unseen users and compensate for appearance variations across people, since a test user eyes' appearance will be reconstructed from similar users within the generic model. (iii) we propose an automatic model selection method that leads to comparable performance with a reduced computational load.

[Paper ] [Code ]

A Semi-Automated System for Accurate Gaze Coding in Natural Dyadic Interactions

In this paper we propose a system capable of accurately coding gazing events in natural dyadic interactions. Contrary to previous works, our approach exploits the actual continuous gaze direction of a participant by leveraging on remote RGB-D sensors and a head pose-independent gaze estimation method. Our contributions are: i) we propose a system setup built from low-cost sensors and a technique to easily calibrate these sensors in a room with minimal assumptions; ii) we propose a method which, provided short manual annotations, can automatically detect gazing events in the rest of the sequence; iii) we demonstrate on substantially long, natural dyadic data that high accuracy can be obtained, showing the potential of our system. Our approach is non-invasive and does not require collaboration from the interactors. These characteristics are highly valuable in psychology and sociology research.

[Paper ][Project ]

EYEDIAP: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras


The lack of a common benchmark for the evaluation of the gaze estimation task from RGB and RGB-D data is a serious limitation for distinguishing the advantages and disadvantages of the many proposed algorithms found in the literature. This paper intends to overcome this limitation by introducing a novel database along with a common framework for the training and evaluation of gaze estimation approaches. In particular, we have designed this database to enable the evaluation of the robustness of algorithms with respect to the main challenges associated to this task: i) Head pose variations; ii) Person variation; iii) Changes in ambient and sensing conditions and iv) Types of target: screen or 3D object.

[Paper ] [Data ][Tech report ] [Code 

Geometric Generative Gaze Estimation (G3E) for Remote RGB-D Cameras

We propose a head pose invariant gaze estimation model for distant RGB-D cameras. It relies on a geometric understanding of the 3D gaze action and generation of eye images. By introducing a semantic segmentation of the eye region within a generative process, the model (i) avoids the critical feature tracking of geometrical approaches requiring high resolution images; (ii) decouples the person dependent geometry from the ambient conditions, allowing adaptation to different conditions without retraining. Priors in the generative framework are adequate for training from few samples. In addition, the model is capable of gaze extrapolation allowing for less restrictive training schemes. Comparisons with state of the art methods validate these properties which make our method highly valuable for addressing many diverse tasks in sociology, HRI and HCI.

[Paper ] [Demo ]

Who Will Get the Grant? A Multimodal Corpus for Analysis of Conversational Behaviours in Group Interviews

In the last couple of years more and more multimodal corpora have been created. Recently many of these corpora have also included RGB-D sensors' data. However, there is to our knowledge no publicly available corpus, which combines accurate gaze-tracking, and high- quality audio recording for group discussions of varying dynamics. With a corpus that would fulfill these needs, it would be possible to investigate higher level constructs such as group involvement, individual engagement or rapport, which all require multi-modal feature extraction. In the following paper we describe the design and recording of such a corpus and we provide some illustrative examples of how such a corpus might be exploited in the study of group dynamics.

[Paper ]

Deciphering the Silent Participant. On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions.


Estimating a silent participant’s degree of engagement and his role within a group discussion can be challenging, as there are no speech related cues available at the given time. Having this information available, however, can provide important insights into the dynamics of the group as a whole. In this paper, we study the classification of listeners into several categories (attentive listener, side participant and bystander). We devised a thin-sliced perception test where subjects were asked to assess listener roles and engagement levels in 15-second video-clips taken from a corpus of group interviews. Results show that humans are usually able to assess silent participant roles. Using the annotation to identify from a set of multimodal low-level features, such as past speaking activity, backchannels (both visual and verbal), as well as gaze patterns, we could identify the features which are able to distinguish between different listener categories. Moreover, the results show that many of the audio- visual effects observed on listeners in dyadic interactions, also hold for multi-party interactions. A preliminary classifier achieves an accuracy of 64%.

[Paper ]

Gaze Estimation in the 3D Space Using RGB-D sensors. Towards Head-Pose And User Invariance.


We address the problem of 3D gaze estimation within a 3D environment from remote sensors, which is highly valuable for applications in human-human and human-robot interactions. To the contrary of most previous works, which are limited to screen gazing applications, we propose to leverage the depth data of RGB-D cameras to perform an accurate head pose tracking, acquire head pose invariance through a 3D rectification process that renders head pose dependent eye images into a canonical viewpoint, and computes the line-of-sight in the 3D space. To address the low resolution issue of the eye image resulting from the use of remote sensors, we rely on the appearance based gaze estimation paradigm, which has demonstrated robustness against this factor. In this context, we do a comparative study of recent appearance based strategies within our framework, study the generalization of these methods to unseen individual, and propose a cross-user eye image alignment technique relying on the direct registration of gaze-synchronized eye images. We demonstrate the validity of our approach through extensive gaze estimation experiments on a public dataset as well as a gaze coding task applied to natural job interviews.

[Paper ]