The following presents a summary of most of my research, with links to relevant articles for each topic (see my publication list for citation details and collaborators in each case).
Much of my research has been done in the context of using microphone arrays as an input device for subsequent speech recognition or speaker recognition. As the cost and processing complexity of arrays increases with the number of microphone elements, I have focussed on achieving good performance using small arrays. I found that it is often beneficial to use a post-filtering stage following the beamformer to enhance the signal-to-noise ratio and significantly improve the speech recognition accuracy. I also formulated a new microphone array post-filter that took a model of the noise field into consideration, further improving performance in diffuse noise fields, such as office environments.
One of my early research interests was integrating the microphone array processing more closely with the subsequent speech recognition system. As arrays have the ability to spatially distinguish between sound sources, they can be used to estimate the noise as well as the desired speech signal. I used array-based noise estimation in conjunction with a few robust speech recognition methods, including sub-band speech recognition, missing feature recognition and feature-based spectral subtraction.
During my years at IDIAP, I also several project that used microphone arrays as a speech acquisition device for meetings or general group conversations. One issue in processing such speech is the need to separate and recognise speech when there is overlap between speakers. In trying to resolve this issue, I worked on databases that we constructed to focus on the overlapping speech condition, first for a small vocabulary task (Numbers) and then for a medium vocabulary task (WSJ). A further challenge in a multi-speaker conversation is to locate and track each person so that the beamformer can be correctly steered towards them. I worked on the integration of an audio-visual speaker tracker with a beamformer as a robust solution to this problem.
In applying microphone array techniques as a pre-processing stage to recognising meeting speech, often the required information on microphone placement is unreliable or unavailable. A simple solution used in the AMI MDM (multiple distant microphone) system is to directly estimate the optimal time delays between each channel and use these to blindly steer the beamformer. In efforts to improve on this simple approach, I proposed a fully automated method for calibrating microphone positions by exploiting a model of the background noise, and studied at what point the errors in microphone positions start to impact the speech recognition performance.
Beyond the direct speech enhancement or recognition applications, I have also been interested in modelling low-level audio-visual cues in order to recognise higher-level information or events, such as meeting phases, individual actions, or the interest-level of a group. Such techniques have potential application to evaluating team communication in small discussion groups, for instance in the health domain.
I have also worked in the health informatics field to apply natural language processing and machine learning to automate the collection of cancer staging data from free-text pathology reports.
