PHASER: Parsimonious Hierarchical Automatic Speech Recognition

The present project aims at exploiting and integrating in a principled way recent developments in posterior- based speech recognition systems, hybrid HMM/ANN systems, exploiting Hidden Markov Model (HMM) and Artificial Neural Networks (ANN), Deep Neural Networks (a particular form of ANN with deep hierarchical and nonlinear architecture), compressive sensing, sparse modeling and hierarchical sparse coding for ASR. The resulting ASR framework that we have in mind should draw on multiple current trends in speech processing, including: concepts obtained from diverse fields including signal processing, compressive sens- ing, and machine learning (HMM, DNN, etc). Besides further research and development in these areas, one of the key pivots of the present proposal also lies in the recent investigation of strong relationships be- tween standard HMM techniques (with HMM states as latent variables) and standard compressive sensing formalism, where the atoms of the compressive dictionary are directly related to posterior distributions of HMM-states.

The proposed research thus takes a new perspective to the speech recognition as a sparse recovery problem which takes low-dimensional observations (at the rate of acoustic features) and provide a high- dimensional sparse inference (at the rate of words) while preserving the linguistic information. The result- ing model should be able to model temporal properties, while exploiting model parsimony and hierarchical structures, while also integrating the phonetic and lexical constraints currently being modeled through the pre-defined HMM topology. To address this very challenging problem, the present project will have to address multiple issues related to (1) sparse features for ASR (going beyond spectrographic speech and standard discriminative training), (2) sparse-based and statistical (HMM)-based ASR (exploiting new type of HMMs, referred to as KL-HMM, and developing theoretical relationships between HMM and sparse modeling-based ASR), and (3) hierarchical structured sparse ASR (going beyond standard sub-word and lexicon definition and representation, and going beyond HMM, while preserving structural and temporal constraints). Exploiting and further developing our various, state-of-the-art, ASR tools (often available as Idiap open- source or integrated in other systems like Kaldi, the resulting systems will be evaluated on the three different databases: TIMIT for phone accuracy, Phonebook for isolated word accuracy, and Switchboard for continuous speech recognition.

Information Interfaces and Presentation
Swiss National Science Foundation
Jun 01, 2014
May 31, 2016