Universal Spoken Term Detection with Deep Learning

The overwhelming majority of state-of-the-art ASR systems follow the same path since about thirty years. The speech signal is first transformed into carefully hand-crafted" features; usually generative models are then used to estimate likelihood of subword units (typically phonemes); dynamic programming methods are then applied to recognize the word sequence under various constraints, such as lexical constraints or language model constraints. This type of approach in several independent steps has great advantages: for e.g., we have very good a priori on the type of processing happening in the human ear, and it is natural to design features reproducing this processing. In the same order of idea, modeling words with phonemes allow the model to be resistant to some invariances, would be harder to get with speech features alone. And finally, decomposing the problem in several sub-problems has also a great advantage in terms of computational cost. However, some researchers have already questioned this type of approach: any problem solved in several steps, even carefully chosen, has good chances to be sub-optimal. Various efforts for finding speech features have been pursued. In certain cases, generative models have been replaced by discriminative approaches, leading to hybrid Neural Networks-Hidden Markov Models methods or Conditional Random Field-based methods. Still, a major question remains: is it possible to design a end-to-end speech system? That is, designing a system which would be trained discriminatively in a end-to-end manner, learning itself the right features for decoding any sequence of word? While this approach is certainly very challenging and seems to shortcut a lot of current research in speech, we strongly believe computers have reached a suffcient processing power to begin research in this new direction. Following research which has been achieved in image and text processing (LeCun et al., 1998; Collobert et al., 2011), we are interested in applying deep learning methods for speech processing. Deep learning algorithms have the hability of learning several layers of features representing the data, which an increasing level of abstraction. They are particularly interesting in the context of speech, as it has been shown with Graph Transformer Networks (Bottou et al., 1997) that it is possible to learn these features through a dynamical programming cost. Our ultimate goal is to show the viability of deep learning methods at several levels of speech processing, basing ourselves on good a priori found by the speech community, but going slowly towards an end-to-end system. As a first modest step, we investigate in this project deep learning techniques to develop a novel end-to-end grapheme-based spoken term detection system.
Application Area - Exploitation of rich multimedia archives, Machine Learning
Idiap Research Institute
Hasler Stiftung (Hasler Foundation)
Dec 01, 2011
Nov 30, 2014