Model-based Sparse Component Analysis

for Multi-party Speech Recognition

 

 

For the PhD program, I was a fellow of  the Speech Communication with Adaptive Learning (SCALE), Marie Curie Initial training network. The SCALE objective was to train scientists to work across traditional boundaries in multidisciplinary themes. My research was under the theme of "Bridging the gap between signal processing and machine learning".

My doctoral research proposed and revolved around a new perspective on microphone array recordings as compressive measurements of the acoustic scene where the measurement matrix can be characterized using the image model of a reverberant acoustic. Relying on this formulation, we cast speech localization and separation as sparse recovery of the high-dimensional spatio-spectral representation of the scene. This idea was first evaluated for multiparty speech recognition and received the paper award of IEEE Spoken Language processing grant at ICASSP'2011. It was one of the first publications on incorporating the early part of the room impulse response through the image model for source localization and separation. On March 21, 2013, the doctoral dissertation was evaluated by the jury committee as an outstanding solid work with excellent quality and quantity both in terms of theory and experiments thus, entitled me for a doctoral degree.

 

Abstract

This thesis takes place in the context of multi-microphone distant speech recognition in multiparty meetings. It addresses the fundamental problem of overlapping speech recognition in reverberant rooms. Motivated from the excellent human hearing performance on such problem, possibly resulting of sparsity of the auditory representation, our work aims at exploiting sparse component analysis in speech recognition front-end to extract the components of the desired speaker from the competing interferences (other speakers) prior to recognition. More specifically, the speech recovery and recognition are achieved by sparse reconstruction of the (high-dimensional) spatio-spectral information embedded in the acoustic scene from (low-dimensional) compressive recordings provided by a few microphones. This approach exploits the natural parsimonious structure of the data pertained to the geometry of the problem as well as the information representation space...; [French]; [German]; [List of Contents]

 

 

CHAPTER 1: Introduction

There are missing principles in what are so far explored and envisioned for machine listening paradigm. The specific focus of this thesis is on speech recognition task. The current systems can beat human listeners in clean training condition but the performance is very poor in noisy unexpected conditions [Kollmeier et al., 2008b]. This thesis addresses an open problem of research which is recognition in the presence of interfering talkers. The current technology breaks down at this scenario hence, the multiparty condition is where we expect some of the missing premises ensuring robustness are ought to be revealed....

 

 

CHAPTER 2: Multiparty Speech Recovery from Multichannel Recordings

This chapter provides an outlook to the state of the art on addressing the problem of speech recoveryfrom the acoustic clutter of interfering voices. We overview the fundamental multichannel speech recovery approaches, and explain the basic idea of sparse component analysis. Inspired from the sparse coding of sensory information for human perception, we overview the general strategy where the computational auditory scene analysis is founded upon to highlight the principles that we can apply in a framework of sparse signal recovery. This survey study puts forward a critical question: Does Distant Speech Recognition require sparse representation and could it benefit from sparse component analysis? We recognize the advantages of sparse representation and provide some insights on how a sparse coding framework can lay the foundation of a DSR system robust to overlapping....

 

 

CHAPTER 3: A Compressive Sensing Perspective to Spatio-Spectral Information Recovery

In this chapter, we state the problem of analysis of the multichannel recordings in terms of recovering the high-dimensional signal information from a few microphone measurements. Our formulation relies on a new perspective that acquisition of the signals by microphone array is a natural realization of the Compressive Sensing (CS) framework. We overview the fundamental CS premises and elaborate on realization of the CS components in our formulation....

 

CHAPTER 4: Structured Sparse Representation

In the previous chapter, we have seen that there are three premises underlying our proposed framework of model-based sparse component analysis namely, structured sparse representation, compressive measurements and model-based sparse recovery. We now move on to elaborate on the structured sparsity models applicable for speech recovery. The focus of this chapter is on the first building block of our model-based sparse component analysis framework. This chapter basically investigates theory and practice of characterizing structured sparsity of the acoustic signals in the form of a spatio-spectral scene. We start with a brief introduction into the theory behind this analysis from the compressive sensing perspective. There are two distinct aspects to the structured sparsity models associated either to the perception of sound or propagation which are studied in the following sections....

 

 

CHAPTER 5: Compressive Acoustic Measurement

In Chapter 3, we briefly reviewed how characterizing the acoustic projections amounts to identifying the geometry of the enclosure and the absorption factors of the reflective surfaces. The experiments conducted there assumed that we know the geometry and the absorption factors. This chapter shows how to estimate these parameters. In Chapter 4, the structured sparsity models underlying multipath propagation were elaborated. This chapter deals with the problem of characterizing the compressive acoustic measurements associated to the projection of the high-dimensional acoustic scene data to the low-dimensional manifold of microphone array. We exploit the structured sparsity models and propose some algorithmic approaches to the problem of estimating the geometry of the room and the absorption coefficients from recordings of unknown concurrent speech sources....

 

 

CHAPTER 6: Model-based Sparse Recovery

In Chapter 3, we briefly reviewed the three premises underlying our model-based sparse component analysis framework namely, structured sparse representation, compressive measurements and model-based sparse recovery. In Chapter 4, we studied the structured sparse representation of the acoustic scene information along with the inter-dependency models of the sparse coefficients. In Chapter 5, characterization of the compressive acoustic measurements was elaborated. This chapter outlines some of the algorithmic approaches to model-based sparse recovery and the performance of each approach is quantified in terms of accuracy of source localization as well as quality of the recovered speech....

 

 

CHAPTER 7: Optimum Structured Sparse Coding

The model-based sparse component analysis framework was established in Chapter 3 along with the three fundamental components. The first component is structured sparse representation which was elaborated in Chapter 4. The second component is compressive acoustic measurements which was characterized in Chapter 5 and the third component is model-based sparse recovery algorithm that we have studied in the previous Chapter 6. This framework assumed that the geometrical set-up of microphone array is known in advance. The recent studies presented in Section 6.4.3 demonstrate that the conventional microphone arrays are not an optimal design and the sparse recovery techniques yield higher performance using ad-hoc microphone topology. Hence, in this chapter we draw a generalization to our framework by formulating a unified structured sparse coding scheme for source-sensor localization and speech recovery. Having the source and sensors being localized, we elaborate on optimality of inverse filtering to perform speech separation and dereverberation....

 

 

 

CHAPTER 8: Optimum Spatial Filtering

The model-based sparse component analysis framework incorporates the prior information on structured sparsity models (Chapter 4) and characterization of the acoustic multipath projections (Chapter 5) to obtain the best estimate of the spatio-spectral components matching the microphone array observations. Therefore, the model-based sparse recovery algorithms perform optimization in the observation space. Alternative to this objective, we can optimize the prediction error of the signal which has been the fundamental concept of spatial filtering techniques. Hence, the goal of this chapter is to incorporate the prior information on structured sparsity models and multipath projections to yield an optimum beamforming formulation....

 

 

CHAPTER 9: Concluding Remarks

The present thesis was inspired by two trends (1) auditory sparse coding and (2) recent algorithmic advances in compressive sensing and sparse signal recovery. We have sought to provide the technical detail and experimental justification for a structured sparse coding approach to multiparty reverberant speech recognition, and we have endeavored to offer some understanding of the mechanism through which sparsity models can be utilized. To conclude, we summarize the key messages of our research and recommend future directions along the lines of these findings....

 

 

Have it All!

 

 Home  Last update on September, 2013