bob.ip.common.data.dataset

Classes

CSVDataset(subsets, fieldnames, loader)

Generic multi-subset filelist dataset that yields samples

JSONDataset(protocols, fieldnames, loader)

Generic multi-protocol/subset filelist dataset that yields samples

class bob.ip.common.data.dataset.JSONDataset(protocols, fieldnames, loader)[source]

Bases: object

Generic multi-protocol/subset filelist dataset that yields samples

To create a new dataset, you need to provide one or more JSON formatted filelists (one per protocol) with the following contents:

{
    "subset1": [
        [
            "value1",
            "value2",
            "value3"
        ],
        [
            "value4",
            "value5",
            "value6"
        ]
    ],
    "subset2": [
    ]
}

Your dataset many contain any number of subsets, but all sample entries must contain the same number of fields.

Parameters
  • protocols (list, dict) – Paths to one or more JSON formatted files containing the various protocols to be recognized by this dataset, or a dictionary, mapping protocol names to paths (or opened file objects) of CSV files. Internally, we save a dictionary where keys default to the basename of paths (list input).

  • fieldnames (list, tuple) – An iterable over the field names (strings) to assign to each entry in the JSON file. It should have as many items as fields in each entry of the JSON file.

  • loader (object) –

    A function that receives as input, a context dictionary (with at least a “protocol” and “subset” keys indicating which protocol and subset are being served), and a dictionary with {fieldname: value} entries, and returns an object with at least 2 attributes:

    • key: which must be a unique string for every sample across subsets in a protocol, and

    • data: which contains the data associated witht this sample

check(limit=0)[source]

For each protocol, check if all data can be correctly accessed

This function assumes each sample has a data and a key attribute. The key attribute should be a string, or representable as such.

Parameters

limit (int) – Maximum number of samples to check (in each protocol/subset combination) in this dataset. If set to zero, then check everything.

Returns

errors – Number of errors found

Return type

int

subsets(protocol)[source]

Returns all subsets in a protocol

This method will load JSON information for a given protocol and return all subsets of the given protocol after converting each entry through the loader function.

Parameters

protocol (str) – Name of the protocol data to load

Returns

subsets – A dictionary mapping subset names to lists of objects (respecting the key, data interface).

Return type

dict

class bob.ip.common.data.dataset.CSVDataset(subsets, fieldnames, loader)[source]

Bases: object

Generic multi-subset filelist dataset that yields samples

To create a new dataset, you only need to provide a CSV formatted filelist using any separator (e.g. comma, space, semi-colon) with the following information:

value1,value2,value3
value4,value5,value6
...

Notice that all rows must have the same number of entries.

Parameters
  • subsets (list, dict) – Paths to one or more CSV formatted files containing the various subsets to be recognized by this dataset, or a dictionary, mapping subset names to paths (or opened file objects) of CSV files. Internally, we save a dictionary where keys default to the basename of paths (list input).

  • fieldnames (list, tuple) – An iterable over the field names (strings) to assign to each column in the CSV file. It should have as many items as fields in each row of the CSV file(s).

  • loader (object) – A function that receives as input, a context dictionary (with, at least, a “subset” key indicating which subset is being served), and a dictionary with {key: path} entries, and returns a dictionary with the loaded data.

check(limit=0)[source]

For each subset, check if all data can be correctly accessed

This function assumes each sample has a data and a key attribute. The key attribute should be a string, or representable as such.

Parameters

limit (int) – Maximum number of samples to check (in each protocol/subset combination) in this dataset. If set to zero, then check everything.

Returns

errors – Number of errors found

Return type

int

subsets()[source]

Returns all available subsets at once

Returns

subsets – A dictionary mapping subset names to lists of objects (respecting the key, data interface).

Return type

dict

samples(subset)[source]

Returns all samples in a subset

This method will load CSV information for a given subset and return all samples of the given subset after passing each entry through the loading function.

Parameters

subset (str) – Name of the subset data to load

Returns

subset – A lists of objects (respecting the key, data interface).

Return type

list