This dataset contains i-vectors of natural and synthetic speech utterances for spoofing and anti-spoofing research purposes. Initial experiments with it are presented in the paper:

    @article{Sizov2015,
      author = {Sizov, A. and Khoury, E. and Kinnunen, T. and Wu, Z. and Marcel, S.},
      title = {Joint Speaker Verification and Anti-Spoofing in the i-Vector Space},
      journal = {IEEE Trans. on Information Forensics and Security},
      year = {2015},
      url = {http://cs.uef.fi/~sizov/pdf/TIFS2015_joint.pdf},
    }

Natural speech for training
---------------------------

We used utterances from NIST SRE04, SRE05 and SRE06 corpora for training purposes (see details in the table below).


Synthetic speech for training
-----------------------------

We used the SPTK toolkit (http://sp-tk.sourceforge.net/) to perform MCEP and LPC analysis and synthesis. A copy-synthesis approach is employed to generate the MCEP- and LPC-coded speech for training the spoofing detector without undergoing any specific VC technique. 

That is, we first decompose a speech signal into its Mel-cepstral (or LPC) and fundamental frequency (F0) parameters and then reconstruct an approximated signal directly from these parameters. The reconstructed replica is the special version of the orignal signal passing through synthesis-channel, and in general will be close to the original signal but not exactly the same due to the lossy analysis-synthesis model; perceptually, a buzzy or muffled voice quality can be observed. Such copy-synthesis is a straightforward way to generate training samples for spoofing detection without, however, involving the computationally demanding stochastic VC part, which additionally requires selection of source-target speaker pairs and parallel training set.

The copy-synthesis speech of SRE04, SRE05 and SRE06 is generated for both MCEP and LPC.
We name the generated corpus as ``synthesis" training set, in contrast to the original ``natural" training set.


Synthetic speech for testing
----------------------------

We employed the spoofing attack dataset designed in

Kinnunen, T. et al. "Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech", ICASSP 2012.
Wu, Z. et al. "A study on spoofing attacks in state-of-the-art speaker verification: the telephone speech case", APSIPA ASC 2012.

It is based on the core task ``1conv4w-1conv4w" of the Speaker Recognition Evaluation 2006 (SRE06) corpus, which is a widely used standard benchmark database for text-independent speaker verification research. We consider two different voice conversion methods: the popular joint-density Gaussian mixture model (JD-GMM) based method and a simplified frame selection (FS) method.

In the JD-GMM conversion, we consider two feature representations, namely Mel-cepstral analysis based features (MCEP) and linear predictive coding based features (LPC), while in the FS conversion, only MCEP features are considered. The difference between JD-GMM and FS conversion is that JD-GMM modifies source features to match that of a target speaker, while FS uses 
the target speaker features directly to generate converted speech.


i-vector extraction details
---------------------------

The full experiments were carried out using the open-source speaker recognition toolbox (https://pypi.python.org/pypi/xspear.fast_plda) which is a modification of the toolbox Spear (https://pypi.python.org/pypi/bob.spear). Acoustic features are extracted at equally-spaced time instants using a sliding window approach. First, a simple energy-based voice activity detection (VAD) is performed to discard the non-speech parts. Second, 19 MFCC and log energy features together with their first- and second-order derivatives are computed over 20 ms Hamming windowed frames every 10 ms. Finally, utterance-level cepstral mean and variance normalization (CMVN) is applied on the resulting 60-dimensional feature vectors.

After feature extraction, the training of the UBM and the T subspace is done using Fisher, Switchboard, SRE04, SRE05 and SRE06 corpora (from which the enrolment and test data used in our experiments were excluded). The UBM model is composed of 2048 Gaussian components and the rank of the total variability matrix T is set to 600.
It is worth noting that both natural and synthetic speech utterances undergo exactly the same procedure of feature andi-vector extraction.


i-vector file format
--------------------
i-vectors are stored in the hdf5 format in the following way:
	Dataset: /ivec
	Size: 600
	Datatype: H5T_IEEE_F64LE (double)
	

Database statistics
-------------------

                            Female                   Male
Training data                                       
    -natural                16192 utt./ 941 spk.     12372 utt./ 614 spk.
    -synthetic              32384 utt./ 1882 spk.    24744 utt./ 1228 spk.

Enrollment data             342 utt./ 342 spk.       241 utt./ 241 spk.
                                                                         
Testing data                                            
    -target trials          2332                     1614
    -impostor trials:       6460                     4528
          
        -zero-effort        1615                     1132
        -spoof (MCEP)       1615                     1132
        -spoof (LPC)        1615                     1132
        -spoof (FS)         1615                     1132              


