File List Databases (CSV)

We saw in Samples, a way to enhance scikit pipelines with metadata that how using samples can improve the workflow of our machine learning experiments. However, we did not discuss how to create the samples in the first place.

In all reproducible machine learning experiments, each database comes with one or several protocols that define exactly which files should be used for training, development, and testing. These protocols can be defined in .csv files where each row represents a sample. Using .csv files to define the protocols of a database is advantageous because the files are easy to create and read. And, they can be imported and used in many different libraries.

Here, we provide bob.pipelines.FileListDatabase that can be used to read .csv files and generate bob.pipelines.Sample. The format is extremely simple. You must put all the protocol files in a folder with the following structure:

dataset_protocols_path/<protocol>/<group>.csv

where each subfolder points to a specific protocol and each file contains the samples of a specific group or set (e.g. training set). The names of the protocols are the names of folders and the name of each group is the name of the file.

Note

Instead of pointing to a folder, you can also point to a compressed tarball that contains the protocol files.

The .csv files must have the following structure:

attribute_1,attribute_2,...,attribute_n
sample_1_attribute_1,sample_1_attribute_2,...,sample_1_attribute_n
sample_2_attribute_1,sample_2_attribute_2,...,sample_2_attribute_n
...
sample_n_attribute_1,sample_n_attribute_2,...,sample_n_attribute_n

Each row will contain exactly one sample (e.g. one image) and each column will represent one attribute of samples (e.g. path to data or other metadata).

An Example

Below is an example of creating the iris database. The .csv files are distributed with this package have the following format:

iris_database/
    default/
        train.csv
        test.csv

As you can see there is only protocol called default and two groups train and test. Moreover, .csv files have the following format:

sepal_length,sepal_width,petal_length,petal_width,target
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
...
>>> import pkg_resources
>>> import bob.pipelines as mario
>>> dataset_protocols_path = pkg_resources.resource_filename(
...     'bob.pipelines', 'tests/data/iris_database')
>>> database = mario.FileListDatabase(
...     dataset_protocols_path,
...     protocol="default",
... )
>>> database.samples(groups="train")
[Sample(data=None, sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]
>>> database.samples(groups="test")
[Sample(data=None, sepal_length='5', sepal_width='3', petal_length='1.6', petal_width='0.2', target='Iris-setosa'), Sample(...)]

As you can see, all attributes are strings. Furthermore, we may want to transform our samples further before using them.

Transforming Samples

bob.pipelines.FileListDatabase accepts a transformer that will be applied to all samples:

>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer

>>> def prepare_data(sample):
...     return np.array(
...         [sample.sepal_length, sample.sepal_width,
...          sample.petal_length, sample.petal_width],
...         dtype=float
...     )

>>> def prepare_iris_samples(samples):
...     return [mario.Sample(prepare_data(sample), parent=sample) for sample in samples]

>>> database = mario.FileListDatabase(
...     dataset_protocols_path,
...     protocol="default",
...     transformer=FunctionTransformer(prepare_iris_samples),
... )
>>> database.samples(groups="train")
[Sample(data=array([5.1, 3.5, 1.4, 0.2]), sepal_length='5.1', sepal_width='3.5', petal_length='1.4', petal_width='0.2', target='Iris-setosa'), Sample(...)]

Note

The transformer used in the FileListDatabase will not be fitted and you should not perform any computationally heavy processing on the samples in this transformer. You are expected to do the minimal processing of samples here to make them ready for experiments. Most of the time you just load the data from disk in this transformer and return delayed samples.

Now our samples are ready to be used and we can run a simple experiment with them.

Running An Experiment

Here, we want to train an Linear Discriminant Analysis (LDA) on the data. Before that, we want to normalize the range of our data and convert the target labels to integers.

>>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
>>> from sklearn.preprocessing import StandardScaler, LabelEncoder
>>> from sklearn.pipeline import Pipeline
>>> scaler = StandardScaler()
>>> encoder = LabelEncoder()
>>> lda = LinearDiscriminantAnalysis()

>>> scaler = mario.wrap(["sample"], scaler)
>>> encoder = mario.wrap(["sample"], encoder, input_attribute="target", output_attribute="y")
>>> lda = mario.wrap(["sample"], lda, fit_extra_arguments=[("y", "y")])

>>> pipeline = Pipeline([('scaler', scaler), ('encoder', encoder), ('lda', lda)])
>>> pipeline.fit(database.samples(groups="train"))
Pipeline(...)
>>> encoder.estimator.classes_
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']...)
>>> predictions = pipeline.predict(database.samples(groups="test"))
>>> predictions[0].data, predictions[0].target, predictions[0].y
(0, 'Iris-setosa', 0)