Sparse and hierarchical Structures for Speech Modeling

Computer science is currently witnessing the emergence of major research activities, arising (more or less independently) from multiple disciplines (statistics, linear algebra, human brain research), and being applied in different ways in multiple application areas, including big data mining, statistical pattern recognition, computer vision, and speech processing (synthesis and recognition). These emerging research areas, which will be investigated in the context of SHISSM , include (1) more and more focus on posterior-based features and systems; (2) revival of (brain-inspired) Artificial Neural Networks in the form of deep/hierarchical architectures (referred to as Deep Neural Networks - DNNs); (3) full exploitation of modern compute resources (big data, large GPU-based processing); (4) sparse coding seeking sparse representation of the processed signals; and (5) compressive sensing and sparse recovery, aiming at modeling the speech signal in large-dimensional sparse spaces (reminiscent to what is believed to happen in the human brain), resulting in simpler processing (e.g., recognition) algorithms. Building upon several predecessor projects, which resulted in strong theoretical and experimental outcomes, SHISSM is thus very ambitious and aims at developing a better theoretical understanding allowing for a principled combination of ‘deep’ (hierarchical) and ‘sparse’ architectures, driven by advanced statistical modelling (posterior-based approaches, as estimated at the DNN outputs), compressive sensing, sparse recovery, and the formal links recently identified by the PI of this project between Hidden Markov Models (HMM) and compressive sensing, also exploiting posterior distribution estimated by DNNs. SHISSM is thus an interdisciplinary project tying together these important emerging areas in the context of speech modeling, and Automatic Speech Recognition (ASR) in particular, although its impact is expected to go go far beyond speech. Ideally, the targeted framework should result in a unified model, more performant on complex pattern recognition tasks, while also providing interesting biological motivations. In the particular context of speech modeling, and has already demonstrated through some of the PI’s work (discussed later), the resulting approach should be more effcient and more performant that standard HMMs (or hybrid HMM/DNN systems currently considered as state-of-the-art), while also being more biologically-sound, and relevant to other related areas such as speech synthesis, speech coding and human speech intelligibility modeling. This project proposal is intended to consolidate and extend the efforts currently being developed in the context of the the Swiss NSF project PHASER QUAD (2 years, 200020 169398), supporting two excellent PhD students (Pranay Dighe and Dhananjay Ram1, whom should nearly have defended their PhD thesis by the start of the present project). Exploiting recent developments around the above research areas, and further developing our various, state-of-the-art, machine learning and speech processing tools, often available as Idiap open source libraries2, as well as through the GitHub site3, or integrated in other open source toolkits like Kaldi4), the resulting systems will be evaluated on the several international benchmark databases (including, depending on the type of research, TIMIT (phones), Phonebook (words), Switchboard (sentences), AMI (noisy and conversational accented speech), GlobalPhone (multilingual), and ITU subjectively scored data (for human auditory modeling).
Idiap Research Institute
Swiss National Science Foundation
Jan 01, 2018
Dec 31, 2021