A data set for human-robot interaction scenarios available for research purposes

Description | Statistics | Evaluation | Samples

Get Data


In the frame of the MuMMER project, we collected the MuMMER data set a data set for human-robot interaction scenarios which is available for research purposes. It comprises 1 h 29 min of multimodal recordings of people interacting with the social robot Pepper in entertainment scenarios, such as quiz, chat, and route guidance. In the 33 clips (of 1 to 4 min long) recorded from the robot point of view, the participants are interacting with the robot in an unconstrained manner.

The data set exhibits interesting features and difficulties, such as people leaving the field of view, robot moving (head rotation with embedded camera in the head), different illumination conditions. The data set contains color and depth videos from a Kinect v2, an Intel D435, and the video from Pepper 1.7.

All the visual faces and the identities in the data set were manually annotated, making the identities consistent across time and clips. The goal of the data set is to evaluate perception algorithms in multi-party human/robot interaction, in particular the reidentification part when a track is lost, as this ability is crucial for keeping the dialog history. The data set can easily be extended with other types of annotations.


The scene View from the Kinect View from the Intel D435 View from the robot


To properly benchmark tracking and perception algorithms, we have annotated all the faces and identities that appear in the three color video streams. All identities are consistent between frames, sequences, and sensors.


Figure 1: Example of annotations in the 3 views: Kinect, Intel, Pepper

Table 1: Main figures of the data set
Feature Value
Number of participants 28
Number of protagonists 22
Number of clips 33
Shortest clip 1 min 6 s
Longest clip 5 min 6 s
Total duration 1 h 29 min
Maximum number of persons in one frame 9
Kinect color frames 80,488
Kinect depth frames 80,865
Intel color frames 80,346
Intel depth frames 80,310
Robot color frames 47,023
Robot depth frames 23,450
Number of annotated faces 506,713


The output of a tracker system can be evaluated in the same way as for the MOT challenge, using MATLAB code or Python code.

We provide a baseline system (code available for research soon).


These videos shows 2 examples of color videos in which the images of the 3 sensors were concatenated (from left to right: a Kinect v2, Intel D435, Pepper 1.7.). Faces are blurred and audio is removed for privacy issues (distributed data contains raw data, not blurred, and with audio).

An easy sequence (only 2 protagonists, no passerby)

A hard sequence (3 protagonists, multiple passerby)