Python Backend for BEAT

The BEAT platform is a web-based system for certifying results for software-based data-driven workflows that can be sub-divided functionally (into processing blocks). The platform takes all burden of hosting data and software away from users by providing a capable computing farm that handles both aspects graciously. Data is kept sequestered inside the platform. The user provides the description of data formats, algorithms, data flows (also known as toolchains) and experimental details (parameters), which are mashed inside the platform to produce beautiful results, easily exportable into computer graphics or tables for scientific reports.

This package defines a backend to execute algorithms written in the Python programming language.

Creating new database views in beat

To implement a view, one needs to write a class that inherits from beat.backend.python.database.View, and implement two methods: index() and get().

Here are the documentation of those methods:

The index() function:

def index(self, root_folder, parameters):
    """Returns a list of (named) tuples describing the data provided by the view.

    The ordering of values inside the tuples is free, but it is expected
    that the list is ordered in a consistent manner (ie. all train images of
    person A, then all train images of person B, ...).

    For instance, assuming a view providing that kind of data:

    ----------- ----------- ----------- ----------- ----------- -----------
    |  image  | |  image  | |  image  | |  image  | |  image  | |  image  |
    ----------- ----------- ----------- ----------- ----------- -----------
    ----------- ----------- ----------- ----------- ----------- -----------
    | file_id | | file_id | | file_id | | file_id | | file_id | | file_id |
    ----------- ----------- ----------- ----------- ----------- -----------
    ----------------------------------- -----------------------------------
    |             client_id           | |             client_id           |
    ----------------------------------- -----------------------------------

    a list like the following should be generated:

    [
        (client_id=1, file_id=1, image=filename1),
        (client_id=1, file_id=2, image=filename2),
        (client_id=1, file_id=3, image=filename3),
        (client_id=2, file_id=4, image=filename4),
        (client_id=2, file_id=5, image=filename5),
        (client_id=2, file_id=6, image=filename6),
        ...
    ]

    DO NOT store images, sound files or data loadable from a file in the list!
    Store the path of the file to load instead.
    """

The get() function:

def get(self, output, index):
    """Returns the data of the provided output at the provided index in the list
    of (named) tuples describing the data provided by the view (accessible at
    self.objs)"""

So if we take as an example the atnt/5 database, the view named “Train” is implemented like this way (note that each view comes with a documentation describing the way the different outputs are synchronised together):

class Train(View):
    """Outputs:
        - image: "{{ system_user.username }}/array_2d_uint8/1"
        - file_id: "{{ system_user.username }}/uint64/1"
        - client_id: "{{ system_user.username }}/uint64/1"

    One "file_id" is associated with a given "image".
    Several "image" are associated with a given "client_id".

    --------------- --------------- --------------- --------------- --------------- ---------------
    |    image    | |    image    | |    image    | |    image    | |    image    | |    image    |
    --------------- --------------- --------------- --------------- --------------- ---------------
    --------------- --------------- --------------- --------------- --------------- ---------------
    |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   |
    --------------- --------------- --------------- --------------- --------------- ---------------
    ----------------------------------------------- -----------------------------------------------
    |                   client_id                 | |                   client_id                 |
    ----------------------------------------------- -----------------------------------------------
    """

    def index(self, root_folder, parameters):
        Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])

        # Open the database and load the objects to provide via the outputs
        db = bob.db.atnt.Database()
        objs = sorted(db.objects(groups='world', purposes=None),
                      key=lambda x: (x.client_id, x.id))

        return [ Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs ]


    def get(self, output, index):
        obj = self.objs[index]

        if output == 'client_id':
            return {
                'value': np.uint64(obj.client_id)
            }

        elif output == 'file_id':
            return {
                'value': np.uint64(obj.file_id)
            }

        elif output == 'image':
            return {
                'value': bob.io.base.load(obj.image)
            }

Note that:

1) This view exactly matches the example from the documentation of the View class. In particular, index() returns a list looking like:

[
    (client_id=1, file_id=1, image=filename1),
    (client_id=1, file_id=2, image=filename2),
    (client_id=1, file_id=3, image=filename3),
    (client_id=2, file_id=4, image=filename4),
    (client_id=2, file_id=5, image=filename5),
    (client_id=2, file_id=6, image=filename6),
    ...
    (client_id=100, file_id=10000, image=filename10000),
]

If there are 10000 images in the dataset, there will be 10000 entries in that list. The platform will use this information to efficiently split the jobs on several machines during the experiment. It is expected that the list is ordered in a logical order (here: entries are grouped by client_id).

2) For each entry in the dataset (represented as a named tuple), all the necessary data is provided by index(). For performance reasons, it is expected that we don’t need to instantiate bob.db.atnt.Database() anymore in the get() method.

3) You’re free to put any info in the index, with the names you want for the field (here for simplicity, we have one field in the tuple per output of the view, with the same name). The platform doesn’t care.

4) Some data from the database can be stored directly in the index (here: client_id and file_id). For others, that require opening a file, put the filename in the index and process the file later in get()

5) The implementation of get() is straightforward: the full index is available as “self.objs”, just return the data corresponding to the provided output at the given index.

As for the JSON file describing the database, the format hasn’t changed. For an example of the usage of the parameters defined in the JSON file and given to index(), you can look at mnist/4.

Once the view is written, you must index the database with the command-line tool, something like this:

./bin/beat —prefix=… db index mydatabase/1/myview

Indices and tables