Documentation for EUSIPCO paper “Speaker Inconsistency Detection in Tampered Video”

Creating databases

So far, the algorithms in this package can be ran on two sets of data: generated from AMI and VidTIMIT databases. Both databases need to be downloaded first. We used the original databases to create corresponding sets of genuine and tampered videos, as well as, evaluation protocols.

VidTIMIT database

From the images in VidTIMIT, we generate video files for genuine subset and we down-sample audio file to 16bits. We also generate our own tampered videos (so far, for each video we replace the speech from 5 other random speakers). Here are the steps on how to generate the datasets:

  • Provided you have VidTIMIT database downloaded to /path/to/vidtimit, generate genuine audio-video:
$ bin/python bob/paper/eusipco2018/scripts_vidtimit/generate_non-tampered-audio.py -d /path/to/vidtimit/audio -o /output/dir/where/genuine/files/will/be
$ bin/python bob/paper/eusipco2018/scripts_vidtimit/generate_non-tampered-video.py -d /path/to/vidtimit/video -o /output/dir/where/genuine/files/will/be
  • Generate tampered video (5 tampered for each genuine):
$ bin/python bin/python bob/paper/eusipco2018/scripts_vidtimit/generate_tampered.py -d /dir/where/genuine/files/are/ -o /dir/where/tampered/files/will/be/ -t 5

This script, for each genuine video will take randomly audio from 5 other people and create audio file with the same name, thus creating 5 audio-video pairs where lip movements do not match the speech.

  • Run face and landmark detection - preprocess videos (this step is specific to the SGE grid at Idiap)
$ cd bob/paper/eusipco2018/job
$ bash submit_cpm_detection.sh $(find /dir/where/genuine/files/are -name '*.avi')
$ bash watch_jobs.sh /dir/where/genuine/files/are
  • Move found detections to genuine and tampered directories:
$ rsync --chmod=0777 -avm --include='*.hdf5' -f 'hide,! */' /dir/where/genuine/files/are/ /dir/where/genuine/files/are/
$ bin/python bob/paper/eusipco2018/scripts_vidtimit/reallocate_annotation_files.py -a /dir/where/genuine/files/are -o /dir/where/tampered/files/are

AMI database

Since AMI has a lot of different types of videos that are not very suitable for lip-sync detection, we need to extract a suitable set of videos (a single person in the video frame speaking). Using the annotation files provided in project/savi/data/ami_annotations/ folder, we cut 15-40 seconds videos from the single speaker shots and use the audio recorded with lapel mic.

To generate training and development data from AMI, follow these steps:

  • Provided you have AMI database downloaded to /path/to/ami, you can generate genuine videos by running the following script:
$ bin/python bob/paper/eusipco2018/scripts_amicorpus/generate_non-tampered.py -d /path/to/ami -a
bob/paper/eusipco2018/data/ami_annotations/p1.trn.mdtm -o /output/dir/where/genuine/files/will/be
  • Generate tampered video (5 tampered for each genuine) set by running the following:
$ bin/python bob/paper/eusipco2018/scripts_amicorpus/generate_tampered.py -d /path/to/ami/genuine/videos -o
/output/dir/where/tampered/files/will/be -t 5

This script, for each genuine video will take randomly audio from 5 other people and merge it with this video, thus creating 5 tampered videos where lip movements do not match the speech.

  • Split video and audio in different files (run for both genuine and tampered directories):
$ bin/python bob/paper/eusipco2018/scripts_amicorpus/bin/extract_audio_from_video.py -d /path/to/ami/videos -o /path/to/ami/videos -p /path/to/ami/videos/
  • The rest of the processing is the same as for VidTIMIT

Step-by-step instructions for reproducing the experiments

For face and landmark detection, please refer to this README (note that although most of the steps could be replicated on a local machine the readme is written with SGE grid in mind and a support infrastructure available at Idiap).

Before training models, video and audio features need to be preprocessed and extracted. First, preprocess video:

$ bin/train_gmm.py bob/paper/eusipco2018/config/video_extraction_pipeline.py -P oneset-licit -s mfcc20mouthdeltas
$ bin/train_gmm.py bob/paper/eusipco2018/config/video_extraction_pipeline.py -P oneset-spoof -s mfcc20mouthdeltas

Then, use audio pipeline to extract audio features, (video features should be ready by then) and train models (here we are using GMMs as example of the classifiers):

$ bin/train_gmm.py bob/paper/eusipco2018/config/audio_extraction_pipeline.py -P oneset-licit -s mfcc20mouthdeltas --projector-file Projector_gmm_mfcc20_mouthdeltas_licit.hdf5
$ bin/train_gmm.py bob/paper/eusipco2018/config/audio_extraction_pipeline.py -P oneset-spoof -s mfcc20mouthdeltas --projector-file Projector_gmm_mfcc20_mouthdeltas_spoof.hdf5

Test the models and compute scores:

$ bin/spoof.py bob/paper/eusipco2018/config/audio_extraction_pipeline.py -P train_dev -a gmm -s mfcc20mouthdeltas --projector-file Projector_gmm_mfcc20_mouthdeltas_spoof.hdf5

Users Guide

Contact

For questions or reporting issues to this software package, contact Pavel Korshunov (pavel.korshunov@idiap.ch).