Sound Source Localization for Robots (SSLR)

Pepper Sound Localization Dataset

Get Data


The Sound Source Localization for Robots (SSLR) Dataset is a collection of real robot audio recordings for the development and evaluation of sound source localization methods. Its features are:

  • We recorded the audio with the Softbank's robot Pepper [1].
  • The data include robot ego noise and overlapping speech sources.
  • The dataset is separated into three sections: loudspeaker recordings, human talker recordings and noise recordings.
  • It can serve as a standard benchmark dataset for the research of learning-based sound localization techniques.


If you use the data for your reasearch or publication, please cite the following paper (bib file, fulltext):

He, Weipeng, Petr Motlicek, and Jean-Marc Odobez. “Deep Neural Networks for Multiple Speaker Detection and Localization.” In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.

Microphone Array and Reference Frame

We used the microphones on Pepper [2] for the recordings. There are four microphones on the top of the robot head, all of them are directional (cardioid, look direction front). The microphone array geometry is specified in the file misc/array_geo, where each row indicates the 3D coordinate of a microphone.

In the dataset for both mircophone and source locations, we use then in the mircophone array frame. It's origin is the microphone array center. The unit is meter. The X, Y and Z axes indicate front, left and up, respectively.

Loudspeaker Recordings

The loudspeaker recordings include 25 hours of re-recorded AMI corpus [3]. The clean speech from the AMI corpus were played from loudspeakers and recorded by Pepper. Source locations were fixed during each segment. There were up to two simultaneous sources.

The following directories include the data:

├── lsp_train_106
├── lsp_train_301
├── lsp_test_106
├── lsp_test_library

The directory names indicate they are used for training or test, and the room where the data were recorded. The rooms are:

106: a large conference room
301: a small conference room
library: a small room with shelves

In each of the directory, there are three sub-directories:

├── lsp_<*>
│   ├── audio
│   ├── gt_file
│   └── gt_frame

Audio Files

The audio directory include the audio wav files, with the file names "RECORD_ID.wav".

File-Level Ground Truth

The gt_file directory include the file-level ground truth labels, with the file names "". Each ground truth file is a Python tuple stored by pickle. The tuple consists of:

  • the recording ID
  • the start time of the recording in the original ROS bag file
  • the end time of the recording in the original ROS bag file
  • list of source labels, each source label is a tuple of:
    • 3D source location
    • source audio file (a segment from the AMI corpus)
    • the start time of the source in the recording
    • the end time of the source in the recording
    • relative volume of the source

Frame-Level Ground Truth

The gt_frame directory include the frame-level ground truth labels, with the file names "RECORD_ID.wWIN_SIZEoHOP_SIZE.gtf.pkl". The WIN_SIZE is the number of samples in a frame, and HOP_SIZE is the number of samples between consecutive frames. We only provide the frame-level ground truth with WIN_SIZE = 8192 and HOP_SIZE = 4096. You can generate ground truth with other specifications using the provided tools.

The files are also stored by Python pickle. Each file consist of list of tuples, and each tuple consists of:

  • the frame ID, which start from 0. The frame of ID t contains samples between [t * HOP_SIZE, t * HOP_SIZE + WIN_SIZE).
  • list of active sources (can be empty list if there is no active source). Each active source contains:
    • 3D source location
    • source type, which is always 1 (speech source).
    • speaker ID

Human Talker Recordings

The human talker recordings includes around 4 minutes recording of human subjects speaking to Pepper in controlled HRI scenarios. The data are in:

├── human
│   ├── audio
│   ├── audio_gt
│   ├── gt_frame
│   └── video_gt

The audio and gt_frame directories are the same as the loudspeaker recordings.

Audio Ground Truth

The audio_gt directory includes the voice activity ground truth. The files "RECORD_ID.txt", are manually labeled and exported from Audacity [4].

In each row, it stores the start timestamp, end timestamp, and speaker of a voice segment. The timestamp is the time with respect to the original ROS bag file, not the wav file in the audio directory. The segments with speaker id "start" and "end" indicate the timestamp of the first and last sample in the wav file, respectively.

Video Data and Ground Truth

The video_gt directory includes the video data and source location ground truth. Each recordings has a sub-directory with RECORD_ID, containing the following files:

r%06d.png: the original images recorded by the Pepper's front camera. The number indicates camera frame ID.
p%06d.png: the images overlaid with the nose positions of the tracked persons. We used the convolutional pose machine (CPM) [5] for the detection of faces and color-based tracker [6] for the tracking. The nose positions are used as the ground truth source locations of talkers.
g%06d.pkl: the ground truth positions, a map from tracked person ID to the 3D location of the sound source.
stamps: the timestamp of each frame.
id2name: the mapping from the tracked person ID to speaker ID (used in the audio annotations).
Although we included the camera images of all recordings, only some of the subjects agreed to allow their recorded data being used as illustration in scientific publications or displayed during public presentation. The recordings that are allowed for display are: s2_15, s2_16, s2_18, s3_23, s3_24, s3_25, s3_26, s3_29, s3_30.

Noise Recordings

The noise recordings are the fan noise recordings of Pepper in a quite office. The file are in the noise directory.


We included scripts to generate frame-level ground truth with different frame specifications. The scripts are:


You can check the usage by running the scripts with option '-h'.



[5] Cao, Zhe, Tomas Simon, Shih-En Wei, and Yaser Sheikh. “Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” ArXiv:1611.08050 [Cs], November 23, 2016.
[6] Khalidov, Vasil, and Jean-Marc Odobez. “Real-Time Multiple Head Tracking Using Texture and Colour Cues.” Idiap, 2017.