Spoken Gigaword

synthetic Spoken Gigaword dataset

Get Data


This is the synthetic Spoken Gigaword dataset, which are parts of the dataset created for the studies on interpreter-aided spoken language understanding (SLU) in the paper below, with three different parts:

  1. SLURP-Fr, an end-to-end SLU dataset based on the French portion of MASSIVE, containing 16,521 synthetic audio samples created using Google TTS, accompanied with 477 real test samples collected from two French speakers at Idiap.
  2. SLURP -Es, a similar dataset based on the parallel Spanish portion of MASSIVE, containing only synthetic samples.
  3. Spoken Gigaword, a speech summarization dataset generated from Gigaword, containing 51,385 synthetic audio samples created using Google TTS.



 If you use this dataset, please cite the following publication :

He, Mutian, and Philip N. Garner. "The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation." Findings of EMNLP 2023.