Parsimonious Hierarchical Automatic Speech Recognition and Query Detection

This project proposal is intended to merge and extend the support for two ongoing and strongly complementary projects on “Parsimonious Hierarchical Automatic Speech Recognition” (PHASER, 200021-153507) and “Adaptive Multilingual Speech Processing” (A-MUSE, 200020-144281), to cover the last two years of the respectively funded PhD students Pranay Dighe (PHASER) and Dhananjay Ram (A-MUSE), in addition to Dr. Afsaneh Asaei (PHASER) as one of the leading postdocs in the field to assist PhD students supervision. After a brief overview of the last 2 years of achievements in the two projects, we describe in details the research activities foreseen to further anchor the novel parsimonious and hierarchical paradigm for the closely related tasks of speech recognition and query detection (hence the project acronym PHASERQUAD). The goal of this project is to exploit and integrate in a principled way recent developments in posterior-based Automatic Speech Recognition (ASR) and Query-by-Example Spoken Term Detection (QbE-STD) systems, hybrid HMM/ANN systems, exploiting Hidden Markov Model (HMM) and Artificial Neural Networks (ANN), Deep Neural Networks (a particular form of ANN with deep/hierarchical and nonlinear architecture), compressive sensing, subspace modeling and hierarchical sparse coding for ASR and QbE-STD. The resulting framework that we have been building upon quite successfully relies on strong relationships between standard HMM techniques (with HMM states as latent variables) and standard compressive sensing formalism, where the atoms of the compressive dictionary are directly related to posterior distributions of HMM-states. The proposed research thus takes a new perspective to speech acoustic modeling as a sparse recovery problem which takes low-dimensional observations (at the rate of acoustic features) and provide a high-dimensional sparse inference (at the rate of words) while preserving the linguistic information, preserving temporal and lexical constraints. To that end, we have proposed novel paradigm for speech recognition and spoken query detection based on sparse subspace modeling of posterior exemplars. To further develop the hierarchical, sparse-based, ASR and QbE-STD systems, and demonstrate their potential on the hardest benchmark tasks, several challenging problems will be addressed over the next two years (resulting in two distinct high-quality PhDs): (1) sparse posterior modeling tailored for speech recognition and detection objectives, (2) exploiting the low-dimensional structure of posterior space for unsupervised adaptation in unseen acoustic condition, and (3) hierarchical structured architectures that can go beyond the topological constraints of HMM for high level linguistic inference. Exploiting and further developing our various, state-of-the-art, speech processing tools (often available as Idiap open source1 or integrated in other systems like Kaldi2), the resulting systems will be evaluated on the three different, very challenging, databases: GlobalPhone (multilingual), AMI (noisy and conversational accented speech), and MediaEval for QbE-STD.
Idiap Research Institute
Swiss National Science Foundation
Oct 01, 2016
Sep 30, 2019