This page is the setup for subjective evaluation of our proposed blind source separation technique via model-based sparse recovery (BSS-MSR) published in:

Model-based Compressive Sensing for Multi-party Distant Speech Recognition, Afsaneh Asaei, Hervé Bourlard and Volkan Cevher, ICASSP 2011.

**Abstract**

We leverage the recent algorithmic advances in compressive sensing, and propose a novel source separation algorithm for efficient recovery of convolutive speech mixtures in spectro-temporal domain. Compared to the common sparse component analysis techniques, our approach fully exploits structured sparsity models to obtain substantial improvement over the existing state-of-the-art. We evaluate our method for separation and recognition of a target speaker in a multi-party scenario. Our results provide compelling evidence of the effectiveness of sparse recovery formulations in speech recognition.

**Background**

**
Compressive Sensing** is sensing via dimensionality reduction.
Dimensionality reduction naturally happens in many problems. So, we can leverage
the CS theory and algorithms. In theory, CS relies on three premises: (1)
sparse representation, (2) incoherent measurement and (3) signal recovery
algorithm. In the following, we briefly explain how each of these ingredients
are realized in the BSS-MSR framework.

**BSS from compressive
measurements**

**Key idea:**
We cast the under-determined speech separation problem as a sparse signal
recovery where we leverage compressive sensing theory to solve it:

(1)** Spatio-spectral sparse representation: **
we
discretize the room into
dense grids where only very few of them have speech activity.
We consider the time-frequency (t-f) representation of
speech signal located at each grid. We
exploit spatial sparsity in tandem with spectral sparsity to obtain a sparse
representation of signal where the sparse coefficients hold a block
inter-dependency structure.

(2) **Incoherent measurement:**
we consider the room acoustic as a rectangular enclosure
consisted of finite-impedance walls. The point source-to-microphone impulse
responses are calculated using Image Method.
Taking into account the physics of the signal propagation,
we
construct the measurement matrix associated with the microphone array using the
projections identified by the media
Green's function.

(3)
**Model-based sparse recovery:** we incorporate the block structure
underlying the sparse coefficients in an efficient model-based sparse recovery
algorithm inspired by the algebra used in
Nesterov’s optimal gradient and optimization techniques; hence, called Algebraic
Pursuit (ALPS). In our signal recovery
step, a
block-sparse signal is approximated
by reweighting and thresholding the energy of the blocks
along with a gradient calculation at each iteration.

**Experiments**

The following set-up is simulated for stereo recording of convolutive multi-party speech using Image Method for 200 ms room reverberation time. Target speech are clean AURORA 2 utterances while the interferences are randomly taken from HTIMIT.

This is an illustration of the speech signals in waveform

You can also listen to a few sample utterances

Stereo Demixing of 3 sources | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
---|---|---|---|---|---|

Reference Close Microphone | |||||

Reference Distant Microphone | |||||

BSS-MSR Demixing Techniques | |||||

Stereo Demixing of 5 sources | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 |
---|---|---|---|---|---|

Reference Close Microphone | |||||

Reference Distant Microphone | |||||

BSS-MSR Demixing Techniques | |||||

The whole experiments are conducted in the framework of AURORA2 speech database. The speech recognition word accuracy rate of the separated target speech for the echoic mixtures of 5 sources as well as the relative improvement obtained are summarized in the following table and bar diagram. BSS-MSR1 refers to the stereo recording and BSS-MSR2 refers to 4-channel microphone recordings.