The TA2 database consists of around 1 hour 20 minutes of audio and video recordings in two separate rooms that have been connected with each other via a video-conferencing system.
There are from 2 to 4 participants in each room at the same time, sitting around a table, chatting, and playing online games. The scenario is rather unconstrained, i.e. the people can move and talk freely. Two different games have been played across the rooms: Battleships and Pictionary.
The following snapshots illustrate the recorded video data of the two rooms respectively. (Faces have been blurred manually here.)
The video has been recorded from a central camera facing the participants. The video of one room has a higher resolution, i.e. 1920x1080 pixels @ 25 fps, and the video of the other room has 720x576 pixels @ 25 fps.
The audio has been recorded from a circular microphone array on the table in the middle of the room with 4, respectively 8, microphones at 48 kHz, 16bit.
Audio and Video has been (manually) synchronised (per room). Note that the video-conferencing hardware, i.e. cameras and microphones, is different from the recording hardware.
Four types of annotation are available (in XML format) for all of the recorded data:
- head positions in the video
- voice activity (speech/non-speech)
- spoken word transcription (+ laughter and other noise)
- Direction Of Arrival (DOA) of sound
All events have been coherently annotated with the respective person IDs.
When using this data for your research, please cite the following paper in your publications:
S. Duffner, P. Motlicek, D. Korchagin, The TA2 Database: A Multi-Modal Database from Home Entertainment, In Proceedings of the International Conference on Signal Acquisition and Processing (ICSAP), February 2011, Singapore