You are here: Home / Gaze estimation from RGB-D cameras

Gaze estimation from RGB-D cameras

The human gaze is recognized as one of the most important non-verbal communication cues. Over the last 30 years there has been an increasing interest to develop tools capable of automatically extracting these cues. Such tools would be valuable for many fields and applications, including psychology, sociology, marketing, robotics, human-computer interfaces, etc.

With the advent of RGB-D cameras accessible to the consumer, e.g. Microsoft Kinect, many computer vision problems have been addressed in ways which were not possible before. The depth information provides geometric information which is independent of illumination and texture.

A strong link exists between head pose and gaze direction. This makes head pose estimation a necessary task when addressing the gaze estimation problem. In this work we propose a multimodal approach which combines depth and visual information to address the gaze estimation problem. The depth information is used to accurately retrieve the head pose parameters, while the visual information is needed to estimate the eyes information from appearance.


[1] Kenneth Funes, Jean-Marc Odobez. "Gaze estimation from multimodal Kinect data ". In Proc. of CVPR Workshop on Gesture Recognition, Rhode Island, US, June 2012.




Method Overview


a) A personalized 3D face template is created offline by fitting a 3DMM to a set of face samples

b) Given the personalized 3D face template the head pose is tracked frame by frame using rigid ICP

c) Head appearance stabilization: using the estimated head pose parameters, the head is rendered using the inverse rigid transformation resulting in frontal version of the face image, i.e. as if the camera was always frontal to the face

d) The eye images are cropped and compared to a frontal gaze appearance model. The obtained gaze direction is transformed using the head pose parameters

Data collection

Here we propose a method for data collection with automatic ground truth extraction. Given a personalized 3D face template we track the head position, and therefore the eyes' position. Simultaneously we track a ball which is discriminative in both depth and color. Assuming the user is following the object with the eyes, we label the gaze vector as the one which points from the eyes center to the position of the ball. This is shown in the following image: