UBIPose

dataset intended for the evaluation of head pose estimation algorithms in natural and challenging scenarios

Description of the Corpus

The UBIPose dataset relies on videos from the UBImpressed dataset, which has been captured to study the performance of students from the hospitality industry at their workplace. The role play happens at a reception desk, where students interact with a research assistant who plays the role of a customer. Students and clients are recorded using a Kinect 2 sensor (one per person). In this free and natural setting, large head poses and sudden head motions are frequent as people are observed from a relatively large distance, and people are mainly seen from the side. Idiap Research Institute shares this dataset to enable the evaluation of head pose estimation algorithms in free and challenging scenarios.

Out of the 160 interactions recorded in the UBImpressed dataset, we selected 32 videos. These videos are divided as follows:

22 videos (with 22 different persons) are provided as evaluation data. In 10 of these videos, 30-50 second clips were cut from the original videos and all frames were annotated. The other 12 videos were fully annotated at one frame persecond. This allowed to gather a large diversity of situations. In total, this amounts to 14.4K frames. The labels we provide are the positions of 6 facial landmarks (two corners of two eyes, nasal root and nose tip) and 3D head poses (roll, pitch, yaw).
10 additional videos can be used for processing and illustrating algorithmic results. These videos are unannotated and intended for the visualization of methods in scientific dissemination activities.

Dataset content

The dataset contains both the orignal video files to be processed (depth and RGB), the ground truth files (including those used for reconstruction and exploited for landmark localization evaluations), and code to evaluate performance. More precisely, the list is as follows:

the RGB videos of the 22 test videos from 22 different users used in the paper for performance evaluation;
the synchronized depth videos of these 22 test videos;
the audio frame indices of these 22 test videos;
the annotated landmarks for 14.4K frames;
the validated inferred head poses for 10.4K frames;
the full output results of our method;
software code to allow computing the performance reported in the paper, as well as performance from produced pose results.
videos for display: 10 additional pairs of RGB and synchronized depth videos can be used for processing and illustrating the algorithm results. These videos are unannotated and only intended for the visualization of methods in public dissemination activities.

References

@inproceedings{Muralidhar:2016:TJB:2993148.2993191,
	author = {Muralidhar, Skanda and Nguyen, Laurent Son and Frauendorfer, Denise and Odobez, Jean-Marc and Schmid Mast, Marianne and Gatica-Perez, Daniel},
	title = {Training on the Job: Behavioral Analysis of Job Interviews in Hospitality},
	booktitle = {Proceedings of the 18th ACM International Conference on Multimodal Interaction},
	series = {ICMI 2016},
	year = {2016},
	location = {Tokyo, Japan},
	pages = {84--91},
	numpages = {8},
	publisher = {ACM},
	address = {New York, NY, USA}
}

@inproceedings{Yu:PAMI:2018,
	author = {Yu, Yu and Kenneth Alberto and Funes Mora and Odobez, Jean-Marc},
	title = {HeadFusion: 360 Head Pose tracking combining 3D Morphable Model and 3D Reconstruction},
	booktitle = {IEEE Transaction on Pattern Analysis and Maschine Intelligence (PAMI)},
	year = {2018}
}