BSS-MSR Demo

This page is the setup for subjective evaluation of our proposed blind source separation technique via model-based sparse recovery (BSS-MSR) published in:

Model-based Compressive Sensing for Multi-party Distant Speech Recognition, Afsaneh Asaei, Hervé Bourlard and Volkan Cevher, ICASSP 2011.

Abstract

We leverage the recent algorithmic advances in compressive sensing, and propose a novel source separation algorithm for efficient recovery of convolutive speech mixtures in spectro-temporal domain. Compared to the common sparse component analysis techniques, our approach fully exploits structured sparsity models to obtain substantial improvement over the existing state-of-the-art. We evaluate our method for separation and recognition of a target speaker in a multi-party scenario. Our results provide compelling evidence of the effectiveness of sparse recovery formulations in speech recognition.

 

Background

Compressive Sensing is sensing via dimensionality reduction. Dimensionality reduction naturally happens in many problems. So, we can leverage the CS theory and algorithms. In theory, CS relies on three premises: (1) sparse representation, (2) incoherent measurement and (3) signal recovery algorithm. In the following, we briefly explain how each of these ingredients are realized in the BSS-MSR framework. 

 

BSS from compressive measurements

Key idea: We cast the under-determined speech separation problem as a sparse signal recovery where we leverage compressive sensing theory to solve it:

(1) Spatio-spectral sparse representation: we discretize the room into dense grids where only very few of them have speech activity. We consider the time-frequency (t-f) representation of speech signal located at each grid. We exploit spatial sparsity in tandem with spectral sparsity to obtain a sparse representation of signal where the sparse coefficients hold a block inter-dependency structure. 

(2) Incoherent measurement: we consider the room acoustic as a rectangular enclosure consisted of finite-impedance walls. The point source-to-microphone impulse responses are calculated using Image Method. Taking into account the physics of the signal propagation, we construct the measurement matrix associated with the microphone array using the projections identified by the media Green's function. 

(3) Model-based sparse recovery: we incorporate the block structure underlying the sparse coefficients in an efficient model-based sparse recovery algorithm inspired by the algebra used in Nesterov’s optimal gradient and optimization techniques; hence, called Algebraic Pursuit (ALPS). In our signal recovery step, a block-sparse signal is approximated by reweighting and thresholding the energy of the blocks along with a gradient calculation at each iteration.

 

Experiments

The following set-up is simulated for stereo recording of convolutive multi-party speech using Image Method for 200 ms room reverberation time. Target speech are clean AURORA 2 utterances while the interferences are randomly taken from HTIMIT.

 

This is an illustration of the speech signals in waveform

You can also listen to a few sample utterances

Stereo Demixing of 3 sources Sample1 Sample2 Sample3 Sample4 Sample5
Reference Close Microphone
Reference Distant Microphone
BSS-MSR Demixing Techniques
Stereo Demixing of 5 sources Sample1 Sample2 Sample3 Sample4 Sample5
Reference Close Microphone
Reference Distant Microphone
BSS-MSR Demixing Techniques

 

The whole experiments are conducted in the framework of AURORA2 speech database. The speech recognition word accuracy rate of the separated target speech for the echoic mixtures of 5 sources as well as the relative improvement obtained are summarized in the following table and bar diagram. BSS-MSR1 refers to the stereo recording and BSS-MSR2 refers to 4-channel microphone recordings.

 

 

Home