AV16-3

Audio-Visual Corpus for Speaker Localization and Tracking

Get Data


Two examples: Snapshots, videos, audio and 3D annotation

The red ellipses indicate the locations of the two microphone arrays. The colored balls on the heads were used in the 3-D annotation process.

seq45 - camera #1 snapshot
Camera 1
seq45 - camera #2 snapshot
Camera 2
seq45 - camera #3 snapshot
Camera 3

High-level description

The AV16.3 corpus is an audio-visual corpus of 43 real indoor multispeaker recordings, designed to test algorithms for audio-only, video-only and audio-visual speaker localization and tracking. Real human speakers were used. The variety of recordings was chosen to test algorithms to their limits, and to cover a wide range of applicative scenarii (meetings, surveillance). The emphasis is on overlapped speech and multiple moving speakers. Recordings include mostly dynamic scenarii, with single and multiple moving speakers. A few meeting scenarii, with mostly seated speakers, are also included. More...

Uses

Using the AV16.3 corpus,

  • Javier Macias-Guarasa made some pretty nice audio tracking demos,
  • T.W. Pirinen (paper) and Jacek P. Dmochowski (journal paper and paper) published on audio localization,
  • Nam Truong Pham published on video multi-camera multi-speaker tracking,
  • Hari K. Maganti used the mouth annotation tool for his own corpus (thesis),
  • Javed Ahmed evaluated his correlation-based visual tracking on seq45-3p-1111, one of the most complex sequences,
  • I used AV16.3 extensively, e.g. in my thesis and a journal paper.

If you would like to be cited here, please contact us at (data-manater at idiap.ch).

Technical details

Recordings were made with two 8-microphone Uniform Circular Arrays (16 kHz sampling frequency) and three digital cameras (25 frames per second) around the meeting room, hence the "AV16.3" name. Whenever possible, lapel microphones were also worn by each speaker. All sensors were synchronized. Thus, the three cameras were calibrated and used to determine the ground-truth 3-D location of the mouth of each speaker, with a maximum error of 1.2 cm. To the best of our knowledge, this audio-visual annotated corpus was the first to be made publicly available (recorded in fall 2003, published in June 2004 at the MLMI'04 workshop).

How to use the corpus

  • The only requirement is to cite the following paper:
    "AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking",
    by Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez,
    in Proceedings of the MLMI'04 Workshop, 2004.
  • If you need to extract PPM images from an AVI file you can use mplayer, as in the following example:
    mplayer -ss 00:00:00 -vo pnm -vf scale=360:288 seq03-1p-0000_cam1_divx_audio.avi
  • Compatibility issues: there are a few binary data files ("*.mat", created with MATLAB 6.5.1).
    If your MATLAB version cannot read those, then use the MATLAB scripts that permit
    to recreate their content ("*_mat.m" ASCII file, in the same directory as each "*.mat" file).
  • For other technical help, (data-manater at idiap.ch).

Acknowledgement

If you use this dataset, please cite the following publication:

"AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking",
by Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez,
in Proceedings of the MLMI'04 Workshop, 2004.