.. _bob.pipelines.csv_database:

File List Databases (CSV)
=========================

We saw in :ref:`bob.pipelines.sample` that how using samples can improve the
workflow of our machine learning experiments. However, we did not discuss how to
create the samples in the first place.

In all reproducible machine learning experiments, each database comes with one
or several protocols that define exactly which files should be used for
training, development, and testing. These protocols can be defined in ``.csv``
files where each row represents a sample. Using ``.csv`` files to define the
protocols of a database is advantageous because the files are easy to create and
read. And, they can be imported and used in many different libraries.

Here, we provide :any:`bob.pipelines.FileListDatabase` that can be used to read
``.csv`` files and generate :py:class:`bob.pipelines.Sample`. The format is extremely
simple. You must put all the protocol files in a folder with the following
structure::

    dataset_protocols_path/<protocol>/<group>.csv

where each subfolder points to a specific *protocol* and each file contains the
samples of a specific *group* or *set* (e.g. training set). The names of the
protocols are the names of folders and the name of each group is the name of the
file.

.. note::

    Instead of pointing to a folder, you can also point to a compressed tarball
    that contains the protocol files.

The ``.csv`` files must have the following structure::

    attribute_1,attribute_2,...,attribute_n
    sample_1_attribute_1,sample_1_attribute_2,...,sample_1_attribute_n
    sample_2_attribute_1,sample_2_attribute_2,...,sample_2_attribute_n
    ...
    sample_n_attribute_1,sample_n_attribute_2,...,sample_n_attribute_n

Each row will contain exactly **one** sample (e.g. one image) and
each column will represent one attribute of samples (e.g. path to data or other
metadata).

An Example
----------

Below is an example of creating the iris database. The ``.csv`` files are
distributed with this package have the following format::

    iris_database/
        default/
            train.csv
            test.csv

As you can see there is only protocol called ``default`` and two groups
``train`` and ``test``. Moreover, ``.csv`` files have the following format::

    sepal_length,sepal_width,petal_length,petal_width,target
    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3,1.4,0.2,Iris-setosa
    ...

.. doctest:: csv_iris_database

    >>> import pkg_resources
    >>> import bob.pipelines as mario
    >>> dataset_protocols_path = pkg_resources.resource_filename(
    ...     'bob.pipelines', 'tests/data/iris_database')
    >>> database = mario.FileListDatabase(
    ...     dataset_protocols_path,
    ...     protocol="default",
    ... )
    >>> database.samples(groups="train")
    [Sample(data=None, sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]
    >>> database.samples(groups="test")
    [Sample(data=None, sepal_length='5', sepal_width='3', petal_length='1.6', petal_width='0.2', target='Iris-setosa'), Sample(...)]

As you can see, all attributes are strings. Furthermore, we may want to
*transform* our samples further before using them.

Transforming Samples
--------------------

:any:`bob.pipelines.FileListDatabase` accepts a transformer that will be applied
to all samples:

.. doctest:: csv_iris_database

    >>> import numpy as np
    >>> from sklearn.preprocessing import FunctionTransformer

    >>> def prepare_data(sample):
    ...     return np.array(
    ...         [sample.sepal_length, sample.sepal_width,
    ...          sample.petal_length, sample.petal_width],
    ...         dtype=float
    ...     )

    >>> def prepare_iris_samples(samples):
    ...     return [mario.Sample(prepare_data(sample), parent=sample) for sample in samples]

    >>> database = mario.FileListDatabase(
    ...     dataset_protocols_path,
    ...     protocol="default",
    ...     transformer=FunctionTransformer(prepare_iris_samples),
    ... )
    >>> database.samples(groups="train")
    [Sample(data=array([5.1, 3.5, 1.4, 0.2]), sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]

.. note::

    The ``transformer`` used in the ``FileListDatabase`` will not be fitted and
    you should not perform any computationally heavy processing on the samples
    in this transformer. You are expected to do the minimal processing of
    samples here to make them ready for experiments. Most of the time you just
    load the data from disk in this transformer and return delayed samples.

Now our samples are ready to be used and we can run a simple experiment with
them.

Running An Experiment
---------------------

Here, we want to train an Linear Discriminant Analysis (LDA) on the data. Before
that, we want to normalize the range of our data and convert the ``target``
labels to integers.

.. doctest:: csv_iris_database

    >>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    >>> from sklearn.preprocessing import StandardScaler, LabelEncoder
    >>> from sklearn.pipeline import Pipeline
    >>> scaler = StandardScaler()
    >>> encoder = LabelEncoder()
    >>> lda = LinearDiscriminantAnalysis()

    >>> scaler = mario.wrap(["sample"], scaler)
    >>> encoder = mario.wrap(["sample"], encoder, input_attribute="target", output_attribute="y")
    >>> lda = mario.wrap(["sample"], lda, fit_extra_arguments=[("y", "y")])

    >>> pipeline = Pipeline([('scaler', scaler), ('encoder', encoder), ('lda', lda)])
    >>> pipeline.fit(database.samples(groups="train"))
    Pipeline(...)
    >>> encoder.estimator.classes_
    array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']...)
    >>> predictions = pipeline.predict(database.samples(groups="test"))
    >>> predictions[0].data, predictions[0].target, predictions[0].y
    (0, 'Iris-setosa', 0)