Audio-Visual Probabilistic Tracking of Multiple Speakers in Meetings

Daniel Gatica-Perez, Guillaume Lathoud, Jean-Marc Odobez, and Iain McCowan
Companion videos


Abstract

Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a novel probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. Our framework integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. We present results -based on an objective evaluation procedure- that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and occlusion; and (3) significantly outperforms a traditional sampling-based approach.
 



Results (videos in AVI format)


Meeting 1. Four people engaged in a conversation, 69 sec. "Speaking" is represented by a double ellipse

  MCMC-PF, 500 particles
  Joint PF, 500 particles


Meeting 2. Four people talking, 48 sec. Effects of visual clutter

  MCMC-PF, 500 particles
  Joint PF, 500 particles


Meeting 2. Five people, handling occlusion (case oc1)

  MCMC-PF, interaction model, 500 particles
  MCMC-PF, no interaction model, 500 particles


Meeting 2. Five people, handling occlusion (case oc2)

  MCMC-PF, interaction model, 500 particles

Meeting 3. Four people talking, 48 sec, one person walks in the room

  MCMC-PF, 500 particles
  Joint PF, 500 particles


Meeting 1. Auto-initialization, full sequence

  MCMC-PF, 500 particles


Meeting 2. Auto-initialization, five-people segment

  MCMC-PF, 500 particles