.. vim: set fileencoding=utf-8 : .. author: Yannick Dayer .. date: 2020-11-27 15:26:09 +01 .. _bob.pad.base.features: =================== Advanced features =================== There are several extra features that we did not discuss. In this section we'll explain the database interface, checkpointing of experiments, and multitasking with Dask. Database interface ================== All PAD databases must inherit from the :py:class:`bob.pad.base.pipelines.Database` class and implement the following methods: - ```database.fit_samples``` returns the samples (or delayed samples) used to train the classifier; - ```database.predict_samples``` returns the samples that will be used for evaluating the system. This is where the group (`dev` or `eval`) is specified. The returned samples must have the following attributes: - ``data``: the data of the sample - ``key``: a unique identifier for the sample. must be a string. - ``attack_type``: the attack type of the sample, must be ``None`` for bonafide samples. This will indicate the presentation attack instrument (PAI) of the attack sample and will be used to report error rates per PAI. - ``subject_id``: The identity of the subject. This might not be available for all databases. File list interface ------------------- A class with those methods returning the corresponding data can be implemented for each dataset, but an easier way to do it is with the *file list* interface. This allows the creation of multiple protocols and various groups by editing some CSV files. The :py:class:`bob.pad.base.database.FileListPadDatabase` class, which builds on :ref:`bob.pipelines.csv_database`, implements this interface. The dataset configuration file will can be as simple as: .. code-block:: python from bob.pad.base.database import FileListPadDatabase database = FileListPadDatabase("path/to/my_dataset", "my_protocol") The files must follow the following structure and naming: .. code-block:: text my_dataset | +-- my_protocol | +-- train.csv +-- dev.csv +-- eval.csv The ``dev.csv`` file is the main file here and is used for scoring samples of the development group. The content of the ``train.csv`` file is used when a protocol contains data for training the classifier. The ``eval.csv`` file is optional and is used in case a protocol contains data for evaluation. These CSV files should contain at least the path to raw data and an identifier to the identity of the subject in the image (subject field) and an attack type. The structure of each CSV file should be as below: .. code-block:: text filename,subject,attack_type path_1,subject_1, path_2,subject_2, path_3,subject_1,attack_1 path_4,subject_2,attack_1 ... The ``attack_type`` field is used to differentiate bonafide presentations from attacks. An empty field indicates a bonafide sample. Otherwise different attack types can be used (e.g. ``print``, ``replay``, etc.), and can be analyzed separately during evaluation. Metadata can be shipped within the Samples (e.g gender, age, session, ...) by adding a column in the CSV file for each metadata: .. code-block:: text filename,subject,attack_type,gender,age path_1,subject_1,,M,25 path_2,subject_2,,F,24 paht_3,subject_1,attack_1,M,25 paht_4,subject_2,attack_1,F,24 ... Checkpoints and Dask ==================== By default, the ``bob pad run-pipeline`` command will save the features of each step of the pipeline and the fitted estimators in the output folder. To avoid this, use the ``--memory`` option. The Dask integration can also be used by giving a client configuration to the ``--dask-client`` option. Basic Idiap SGE configurations are defined by bob.pipelines: ``sge`` and ``sge-gpu``:: $ bob pad run-pipeline --output output_dir --dask-client sge ... .. note:: You may want to read the Dask section in :ref:`bob.bio.base.pipeline_simple_advanced_features` as well for more in-depth information.