Inputs/Outputs

Introduction

The requirements for the platform when reading/writing data are:

  • Ability to manage large and complex data
  • Portability to allow the use of heterogeneous environments

Based on our experience and on these requirements, we investigated the use of HDF5. Unfortunately, HDF5 is not convenient to handle structures such as arrays of variable-size elements, for instance, array of strings. Therefore, we decided to rely on our own binary format.

Binary Format

Our binary format does not contains information about the format of the data itself, and it is hence necessary to know this format a priori. This means that the format cannot be inferred from the content of a file.

We rely on the following fundamental C-style formats:

  • int8
  • int16
  • int32
  • int64
  • uint8
  • uint16
  • uint32
  • uint64
  • float32
  • float64
  • complex64 (first real value, and then imaginary value)
  • complex128 (first real value, and then imaginary value)
  • bool (written as a byte)
  • string

An element of such a basic format is written in the C-style way, using little-endian byte ordering.

Besides, dataformats always consist of arrays or dictionary of such fundamental formats or compound formats.

An array of elements is saved as followed. First, the shape of the array is saved using an uint64 value for each dimension. Next, the elements of the arrays are saved in C-style order.

A dictionary of elements is saved as followed. First, the key are ordered according to the lexicographic ordering. Then, the values associated to each of these keys are saved following this ordering.

The platform is data-driven and always processes chunks of data. Therefore, data are always written by chunks, each chunk being preceded by a text-formated header indicated the start- and end- indices followed by the size (in bytes) of the chunck.

Considering the Python backend of the platform, this binary format has been successfully implemented using the struct module.