Databases

A database is a collection of data files, one for each output of the database. This data are inputs to the BEAT toolchains. Therefore, it is important to define evaluation protocols, which describe how a specific system must use the raw data of a given database.

For instance, a recognition system will typically use a subset of the data to train a recognition model, while another subset of data will be used to evaluate the performance of this model.

Structure of a database

A database has the following structure on disk:

database_name/
    output1_name.data
    output2_name.data
    ...
    outputN_name.data

For a given database, BEAT will typically stores information about the root folder containing this raw data as well as a description of it.

Evaluation protocols

A BEAT evaluation protocol consists of several datasets, each dataset having several outputs with well-defined data formats. In practice, each dataset will typically be used for a different purpose.

For instance, in the case of a simple face recognition protocol, the database may be split into three datasets: one for training, one for enrolling client-specific model, and one for testing these models. The training dataset may have two outputs: grayscale images as two-dimensional array of type uint8 and client id as uint64 integers.

BEAT is data-driven, which means that all the outputs of a given dataset are synchronized. The way the data is generated by each template is defined in a piece of code called the database view. It is important that a database view has a deterministic behavior for reproducibility purposes.

Databases in BEAT such as other building blocks are consisting of two main components, a JSON declaration and a source code (database view that is written in Python). We will describe each component in the following.

JSON declaration

Each database has a JSON declaration. This file has the information about the protocols, datasets included in each protocol, the database view used by each dataset, and much more. Here is an example of the JSON declaration file for atnt database that only has one protocol named “idiap”. This protocol is used for a simple face recognition system and has three datasets, “train, “templates”, and “probes”.

{
    "description": "The AT&T Database of Faces",
    "protocols": [
        {
            "name": "idiap",
            "sets": [
                {
                    "name": "train",
                    "outputs": {
                        "client_id": "system/uint64/1",
                        "file_id": "system/uint64/1",
                        "image": "system/array_2d_uint8/1"
                    },
                    "parameters": {},
                    "template": "train",
                    "view": "Train"
                },
                {
                    "name": "templates",
                    "outputs": {
                        "client_id": "system/uint64/1",
                        "file_id": "system/uint64/1",
                        "image": "system/array_2d_uint8/1",
                        "template_id": "system/uint64/1"
                    },
                    "parameters": {},
                    "template": "templates",
                    "view": "Templates"
                },
                {
                    "name": "probes",
                    "outputs": {
                        "client_id": "system/uint64/1",
                        "file_id": "system/uint64/1",
                        "image": "system/array_2d_uint8/1",
                        "probe_id": "system/uint64/1",
                        "template_ids": "system/array_1d_uint64/1"
                    },
                    "parameters": {},
                    "template": "probes",
                    "view": "Probes"
                }
            ],
            "template": "simple_face_recognition"
        }
    ],
    "root_folder": "/path_to_db_folder/att_faces",
"environment": {
    "name": "Databases",
    "version": "1.0.0"
}
}

The JSON file for a database has three main field:

  • description: A short description of the database.

  • protocols: a list of protocols defined for the database.

  • root_folder: path to the directory where the data is stored.

and one optional field:

  • environment: The name and version of the runtime environment to use for this database.

If the environment field is not given, the system will try to find an appropriate environment using the database name as reference. If no database environment declares support for the database, then the execution will fail.

The content of name and version follows the same pattern as for the execution environment as can be seen on the beat environments page. The name is what comes before the parenthesis and the version is the content found between the parenthesis.

The “protocols” field is where the datasets for each protocol is defined. In the example above only one protocol is defined. Implementing a new protocol means adding a new entry to the list of protocols. Each protocol has three main component:

  • name: The name of the protocol which is “idiap” in this case.

  • sets: The datasets which are included in this protocol. In this case the “idiap” protocol consists of three datasets: “train”, “templates”, and “probes”.

  • template: A template describes the number of sets and the set-template used for each set. Different protocols can use the same template which means they can be used in any application that accepts such structure. However each set may use different database view that makes the protocols different eventually.

Each set in the list of “sets” in the above example is a dataset that is used for a particular purpose. For example in case of simple face recognition, dataset “train” is used for training a model, “templates” is used for making templates for each identity and “probes” is used to measure the performance of the system. Each set has the following components:

  • name: The name of the set.

  • outputs: The outputs provided by the set. Each output has a name and a specific data format which should be taken into consideration when using the data.

  • parameters: Extra parameters which can be given to the index() method of a database view and can be used to further specify the data fed to the system. For example two datasets can use the same database view, but a parameter (e.g. “group”: “train”) can be given to the system and therefor only the data that are in that group will be available in the output of the database.

  • template: Template defines the number of output and their names.

  • view: The database view that is used to provide this data samples to the system. More information about the implementation of database view is given in Database View.

Database View

A database view is a piece of code that defines how the raw data should be fed to the system based on defined protocols. Each database view is a class that inherits from beat.backend.python.database.View and two methods is implemented in them: index() and get(). Each database block in an experiment is assigned to a database view.

The index() method is only used when the system is indexing the raw data. This means that the system makes a list of available raw data objects. Here is an example of an index() method:

def index(self, root_folder, parameters):

    """
    This function a list of (named) tuples describing the data provided by the view.
        The ordering of values inside the tuples is free, but it is expected
        that the list is ordered in a consistent manner (ie. all train images of
        person A, then all train images of person B, ...).

        For instance, assuming a view providing that kind of data:

        ----------- ----------- ----------- ----------- ----------- -----------
        |  image  | |  image  | |  image  | |  image  | |  image  | |  image  |
        ----------- ----------- ----------- ----------- ----------- -----------
        ----------- ----------- ----------- ----------- ----------- -----------
        | file_id | | file_id | | file_id | | file_id | | file_id | | file_id |
        ----------- ----------- ----------- ----------- ----------- -----------
        ----------------------------------- -----------------------------------
        |             client_id           | |             client_id           |
        ----------------------------------- -----------------------------------

        a list like the following should be generated:

        [
            (client_id=1, file_id=1, image=filename1),
            (client_id=1, file_id=2, image=filename2),
            (client_id=1, file_id=3, image=filename3),
            (client_id=2, file_id=4, image=filename4),
            (client_id=2, file_id=5, image=filename5),
            (client_id=2, file_id=6, image=filename6),
            ...
        ]



        DO NOT store images, sound files or data loadable from a file in the list!
        Store the path of the file to load instead.
        """
    Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])

    # Open the database and load the objects to provide via the outputs
    db = bob.db.atnt.Database()
    objs = sorted(db.objects(groups='world', purposes=None),
                  key=lambda x: (x.client_id, x.id))

    return [Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs]

The database views available in the BEAT platform are using bob database packages that have well defined protocols and datasets (e.g. train/dev/test). For more information see database interfaces. Some examples:

However defining new database views are not limited to using such packages.

The get() method is used every time a block is fetching raw data from the database. The dataformat for the outputs of database is defined in this method. for example:

def get(self, output, index):

    """
    This function returns the data at the provided index for the output in the list
    of (named) tuples defined in index() method. The full index is available as ``“self.objs”``
    """

    obj = self.objs[index]

    if output == 'client_id':
        return {
            'value': np.uint64(obj.client_id)
        }

    elif output == 'file_id':
        return {
            'value': np.uint64(obj.file_id)
        }

    elif output == 'image':
        return {
            'value': bob.io.base.load(obj.image)
        }

If you want to know more about the underlying source code of these two methods, you can refer to here

In the following we present an example of a database view that is used by a subset of atnt database:

class Train(View):
    """Outputs:
        - image: "system/array_2d_uint8/1"
        - file_id: "system/uint64/1"
        - client_id: "system/uint64/1"

    One "file_id" is associated with a given "image".
    Several "image" are associated with a given "client_id".

    --------------- --------------- --------------- --------------- --------------- ---------------
    |    image    | |    image    | |    image    | |    image    | |    image    | |    image    |
    --------------- --------------- --------------- --------------- --------------- ---------------
    --------------- --------------- --------------- --------------- --------------- ---------------
    |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   |
    --------------- --------------- --------------- --------------- --------------- ---------------
    ----------------------------------------------- -----------------------------------------------
    |                   client_id                 | |                   client_id                 |
    ----------------------------------------------- -----------------------------------------------
    """

    def index(self, root_folder, parameters):
        Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])

        # Open the database and load the objects to provide via the outputs
        db = bob.db.atnt.Database()
        objs = sorted(db.objects(groups='world', purposes=None),
                      key=lambda x: (x.client_id, x.id))

        return [Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs]

    def get(self, output, index):
        obj = self.objs[index]

        if output == 'client_id':
            return {
                'value': np.uint64(obj.client_id)
            }

        elif output == 'file_id':
            return {
                'value': np.uint64(obj.file_id)
            }

        elif output == 'image':
            return {
                'value': bob.io.base.load(obj.image)
            }

Note

Each view comes with a documentation describing the way the different outputs are synchronized together.

In the example above if there are 10000 images in the dataset, there will be 10000 entries in list returned form the index method. The BEAT platform will use this information to efficiently split the jobs on several machines during the experiment. It is expected that the list is ordered in a logical order (here: entries are grouped by client_id).

For each entry in the dataset (represented as a named tuple), all the necessary data is provided by index(). For performance reasons, it is expected that we don’t need to instantiate bob.db.atnt.Database() anymore in the get() method. The user can put any information in the index method, except for the names that are reserved by python named tuple such as class. If the user wants to use such names they should add it to a dictionary before defining the index method.

def __init__(self):
        super().__init__()
        self.output_member_map = {'class': 'cls'}

Some information from the database can be stored directly in the index() (in the given example: client_id and file_id). For others, that require opening a file, only the filename should be defined in the index() and the file should be processed later in get()

Once the database view is written, the user must index the database with the command-line tool:

beat database index mydatabase/1

The user can index the content of a protocol:

$ beat database index mydatabase/1/protocolname

Or the content of a set in each protocol:

$ beat database index mydatabase/1/protocolname/setname

Database set templates

In practice, different databases used for the same purpose may have the exact same datasets with the exact same outputs (and attached data formats). In this case, it is interesting to abstract the definition of the database sets from a given database. BEAT defines database set templates for this purpose.

For instance, the simple face recognition evaluation protocol described above, which consists of three datasets and few inputs may be abstracted in a database set template. This template defines both the datasets, their outputs as well as their corresponding data formats. Next, if several databases implements such a protocol, they may rely on the same database set template. Similarly, evaluation protocols testing different conditions (such as enrolling on clean and testing on clean data vs. enrolling on clean and testing on noisy data) may rely on the same database set template.

In practice, this reduces the amount of work to integrate new databases and/or new evaluation protocols into the platform. Besides, at the experiment level, this allows to re-use a toolchain on a different database, with almost no configuration changes from the user.

Databases and Docker

The BEAT docker images can be found on the beat Docker Hub page.

To get one of the image on your machine, just type the classic Docker command:

$ docker pull beatenv/beat.env.XXX:TAG

Note that there are no “latest” version because of the reproducibility goal of BEAT. Therefore, you will always have to pull the precise version you want to use.

In order to list all the BEAT related Docker images on your machine, use the following command:

$ beat environments list –type docker

To have a more curated list of images, the following command shows only the database related BEAT images:

$ docker images –format “{{.Repository}}:{{.Tag}}” –filter label=”beat.env.type=database”