AVSpoof

Database including 10 types of voice recognition attacks

The AVSpoof database is intended to provide stable, non-biased spoofing attacks in order for researchers to test both their ASV systems and anti-spoofing algorithms. The attacks are created based on newly acquired audio recordings. The data acquisition process lasted approximately two months with 44 persons, each participating in several sessions configured in different environmental conditions and setups. After the collection of the data, the attacks, more precisely, replay, voice conversion and speech synthesis attacks were generated.

The data acquisition process is divided into four different sessions, each scheduled several days apart in different setups and environmental conditions (e.g. different in terms of background noise, reverberation, etc.) for each of 31 male and 13 female participants. The first session which is supposed to be used as training set while creating the attacks, was performed in the most controlled conditions. Besides, the conditions for the last three sessions dedicated to test trials were more relaxed in order to grasp the challenging scenarios. The audio data were recorded by three different devices including (a) one good-quality microphone, AT2020USB+, and two mobiles, (b) Samsung Galaxy S4 (phone1) and (c) Iphone 3GS (phone2) .

The positioning of the devices was stabilized for each session and each participant in order to standardize the recording settings.

For each session, the participant was subjected to three different data acquisition protocols as in the following:

Reading part (read): 10/40 pre-defined sentences are read by the participant.
Pass-phrases part (pass): 5 short prompts are read by the participant.
Free speech part (free): The participant speaks freely about any topic for 3 to 10 minutes.

The number, the length, as well as the content of the sentences for the reading and pass-phrases part are carefully selected in order to satisfy the constraints in terms of readability, data acquisition and attack quality. Similarly, the minimum duration of the free speech part is also determined according to our preliminary investigations mostly on the voice conversion attacks for which the free speech data would be included in the training set. Please refer to Table 1 for the statistics of the collected data.

	Session 1	Session 2-4	Total
read	40 sentences	10 sentences	25.96 hours
pass	5 pass-phrases	5 pass-phrases	4.73 hours
free	≥ 5 min.	≥ 3 min.	38.51 hours

Table 1: The statistics of the collected data in terms of session, recording type and acquisition device.

In the spoofing attack creation phase, we considered creating spoofing trials for the text-dependent utterances of the testing data, i.e. reading parts of sessions 2-4 and the pass-phrases of all four sessions. As a preliminary step before the creation of the attacks, the speech data originally recorded at 44.1 KHz sampling rate is down-sampled to 16 KHz.

There are four main spoofing attacks for ASV systems: Impersonation, replay, speech synthesis and voice conversion. As the impersonation is known not to be a serious threat for ASV systems, we did not include it in our database. For the remaining three spoofing types, we designed ten different scenarios (see Table 2). We gave special attention to physical access attacks. These attacks are more realistic than logic access attacks considering the fact that the attacker often has no direct access to the system. The acquisition devices (sensors) are open to anyone, therefore more subjected to such attacks.

1 . Replay Attacks

A replay attack consists of replaying a pre-recorded speech to an ASV system. We assume that the ASV system has a good quality microphone and the replay attack targets this sensor. Three different scenarios are considered:

Attacks	Num. of trials per speaker		Total num. of trials
Attacks	Male	Female	Male	Female
Replay-phone1	50	50	1,550	650
Replay-phone2	50	50	1,550	650
Replay-laptop	50	50	1,550	650
Replay-laptop-HQ	50	50	1,550	650
Speech-Synthesis-LA	35	35	1,085	455
Speech-Synthesis-PA	35	35	1,085	455
Speech-Synthesis-PA-HQ	35	35	1,085	455
Voice-Conversion-LA	1,500	600	46,500	7,800
Voice-Conversion-PA	1,500	600	46,500	7,800
Voice-Conversion-PA-HQ	1,500	600	46,500	7,800

Table 2: Number of spoofing trials per gender.

Replay-phone1: Replay attack using the data captured by the Samsung mobile. The speech recorded by this mobile is replayed using its own speakers and re-recorded by the microphone of the ASV system.
Replay-phone2: Replay attack using the data captured by the iPhone mobile. The speech recorded by this mobile is replayed using its own speakers and re-recorded by the microphone of the ASV system.
Replay-laptop: Replay attack using the data captured by the microphone of the ASV system. The speech recorded by this microphone is replayed using the laptop speakers and re-recorded again by the microphone of the system.
Replay-laptop-HQ: Replay attack using the data captured by the microphone of the ASV system. The speech recorded by this microphone is replayed using external high-quality loudspeakers and re-recorded using the microphone of the ASV system.

2 . Speech Synthesis Attacks

The speech synthesis attacks were based on statistical parametric speech synthesis (SPSS). More specific, hidden Markov model (HMM)-based speech synthesis technique was used.

Three scenarios were involved:

Speech-Synthesis-LA: Speech synthesis via logical access. The synthesized speech is directly presented to the ASV system without being re-recorded.
Speech-Synthesis-PA: Speech synthesis via physical access. The synthesized speech is replayed using the laptop speakers and re-recorded by the microphone of the ASV system.
Speech-Synthesis-PA-HQ: Speech synthesis via high-quality physical access. The synthesized speech is replayed using external high-quality loudspeakers and re-recorded by the microphone of the ASV system.

3 . Voice Conversion Attacks

The voice conversion attacks were created using Festvox. A conversion function for each pair of source-target speaker is found based on the learned GMM model/parameters by using the source and target speakers training data. We did not consider cross-gender voice conversion attacks, that is only male-to-male and female-to-female conversions were taken into account. As in the case of speech synthesis, three possible scenarios are involved:

Voice-Conversion-LA: Voice conversion via logical access. The converted speech is directly presented to the system without being re-recorded.
Voice-Conversion-PA: Voice conversion via physical access. The converted speech is replayed using the speakers of the laptop and re-recorded by the microphone of the ASV system.
Voice-Conversion-PA-HQ: Voice conversion via high-quality physical access. The converted speech is replayed using external high-quality loudspeakers and re-recorded by the microphone of the ASV system.

If you use this database we kindly ask you to cite the following paper [1]:

References

[1] S. K. Ergünay, E. Khoury, A. Lazaridis, and S. Marcel. On the vulnerability of speaker verification to realistic voice spoofing. In Proc. Int. Conf. on Biometrics: Theory, Applications and Systems (BTAS), 2015.
10.1109/BTAS.2015.7358783
http://publications.idiap.ch/index.php/publications/show/3185