Command-Line Interface (CLI)

This package provides a single entry point for all of its applications using Bob’s unified CLI mechanism. A list of available applications can be retrieved using:

$ bob binseg --help
Usage: bob binseg [OPTIONS] COMMAND [ARGS]...

  Binary 2D Image Segmentation Benchmark commands.

Options:
  -h, -?, --help  Show this message and exit.

Commands:
  analyze         Runs a complete evaluation from prediction to comparison
  compare         Compare multiple systems together.
  config          Command for listing, describing and copying...
  dataset         Commands for listing and verifying datasets
  evaluate        Evaluate an FCN on a binary segmentation task.
  experiment      Runs a complete experiment, from training, to...
  mkmask          Commands for generating masks for images in a dataset.
  predict         Predicts vessel map (probabilities) on input images.
  significance    Evaluates how significantly different are two models on...
  train           Trains an FCN to perform binary segmentation.
  train-analysis  Analyze the training logs for loss evolution and...

Setup

A CLI application to list and check installed (raw) datasets.

$ bob binseg dataset --help
Usage: bob binseg dataset [OPTIONS] COMMAND [ARGS]...

  Commands for listing and verifying datasets

Options:
  -h, -?, --help  Show this message and exit.

Commands:
  check  Checks file access on one or more datasets
  list   Lists all supported and configured datasets

List available datasets

Lists supported and configured raw datasets.

$ bob binseg dataset list --help
Usage: bob binseg dataset list [OPTIONS]

  Lists all supported and configured datasets

Options:
  -v, --verbose   Increase the verbosity level from 0 (only error messages) to
                  1 (warnings), 2 (log messages), 3 (debug information) by
                  adding the --verbose option as often as desired (e.g. '-vvv'
                  for debug).
  -h, -?, --help  Show this message and exit.

  Examples:

      1. To install a dataset, set up its data directory ("datadir").  For
         example, to setup access to DRIVE files you downloaded locally at
         the directory "/path/to/drive/files", do the following:
  
         $ bob config set "bob.ip.binseg.drive.datadir" "/path/to/drive/files"

         Notice this setting **is** case-sensitive.

      2. List all raw datasets supported (and configured):

         $ bob binseg dataset list

Check available datasets

Checks if we can load all files listed for a given dataset (all subsets in all protocols).

$ bob binseg dataset check --help
Usage: bob binseg dataset check [OPTIONS] [DATASET]...

  Checks file access on one or more datasets

Options:
  -l, --limit INTEGER RANGE  Limit check to the first N samples in each
                             dataset, making the check sensibly faster.  Set
                             it to zero to check everything.  [x>=0; required]
  -v, --verbose              Increase the verbosity level from 0 (only error
                             messages) to 1 (warnings), 2 (log messages), 3
                             (debug information) by adding the --verbose
                             option as often as desired (e.g. '-vvv' for
                             debug).
  -h, -?, --help             Show this message and exit.

  Examples:

  1. Check if all files of the DRIVE dataset can be loaded:

     $ bob binseg dataset check -vv drive

  2. Check if all files of multiple installed datasets can be loaded:

     $ bob binseg dataset check -vv drive stare

  3. Check if all files of all installed datasets can be loaded:

     $ bob binseg dataset check

Preset Configuration Resources

A CLI application allows one to list, inspect and copy available configuration resources exported by this package.

$ bob binseg config --help
Usage: bob binseg config [OPTIONS] COMMAND [ARGS]...

  Command for listing, describing and copying configuration resources.

Options:
  -?, -h, --help  Show this message and exit.

Commands:
  copy      Copy a specific configuration resource so it can be modified...
  describe  Describe a specific configuration file.
  list      List configuration files installed.

Listing Resources

$ bob binseg config list --help
Usage: bob binseg config list [OPTIONS]

  List configuration files installed.

Options:
  -v, --verbose   Increase the verbosity level from 0 (only error messages) to
                  1 (warnings), 2 (log messages), 3 (debug information) by
                  adding the --verbose option as often as desired (e.g. '-vvv'
                  for debug).
  -h, -?, --help  Show this message and exit.

  Examples:

    1. Lists all configuration resources (type: bob.ip.binseg.config) installed:

       $ bob binseg config list

    2. Lists all configuration resources and their descriptions (notice this may
       be slow as it needs to load all modules once):

       $ bob binseg config list -v

Available Resources

Here is a list of all resources currently exported.

$ bob binseg config list -v
module: bob.ip.binseg.configs.datasets
  chasedb1              CHASE-DB1 dataset for Vessel Segmentation (first-anno...
  chasedb1-1024         CHASE-DB1 dataset for Vessel Segmentation
  chasedb1-2nd          CHASE-DB1 dataset for Vessel Segmentation (second-ann...
  chasedb1-768          CHASE-DB1 dataset for Vessel Segmentation
  chasedb1-covd         COVD-CHASEDB1 for Vessel Segmentation
  chasedb1-mtest        CHASE-DB1 cross-evaluation dataset with matched resol...
  chasedb1-xtest        CHASE-DB1 cross-evaluation dataset
  combined-cup          Combining all optic cup dataset together with the sam...
  combined-disc         Combining all optic disc dataset together with the sa...
  combined-vessels      Combining all vessel dataset together with the same r...
  csv-dataset-example   Example CSV-based custom filelist dataset
  cxr8                  CXR8 Dataset (default protocol)
  cxr8-idiap            CXR8 Dataset ("idiap" protocol - just like "default",...
  cxr8-idiap-xtest      CXR8 cross-evaluation dataset with Idiap directory st...
  cxr8-xtest            CXR8 cross-evaluation dataset
  drhagis               DRHAGIS dataset for Vessel Segmentation (default prot...
  drionsdb              DRIONS-DB for Optic Disc Segmentation (expert #1 anno...
  drionsdb-2nd          DRIONS-DB for Optic Disc Segmentation (expert #2 anno...
  drionsdb-2nd-512      DRIONS-DB for Optic Disc Segmentation (expert #2 anno...
  drionsdb-512          DRIONS-DB for Optic Disc Segmentation (expert #1 anno...
  drionsdb-768          DRIONS-DB for Optic Disc Segmentation (expert #1 anno...
  drishtigs1-cup        DRISHTI-GS1 dataset for Cup Segmentation (agreed by a...
  drishtigs1-cup-512    DRISHTI-GS1 dataset for Cup Segmentation (agreed by a...
  drishtigs1-cup-768    DRISHTI-GS1 dataset for Cup Segmentation (agreed by a...
  drishtigs1-cup-any    DRISHTI-GS1 dataset for Cup Segmentation (agreed by a...
  drishtigs1-disc       DRISHTI-GS1 dataset for Optic Disc Segmentation (agre...
  drishtigs1-disc-512   DRISHTI-GS1 dataset for Optic Disc Segmentation (agre...
  drishtigs1-disc-768   DRISHTI-GS1 dataset for Optic Disc Segmentation (agre...
  drishtigs1-disc-any   DRISHTI-GS1 dataset for Optic Disc Segmentation (agre...
  drive                 DRIVE dataset for Vessel Segmentation (default protoc...
  drive-1024            DRIVE dataset for Vessel Segmentation (Resolution use...
  drive-2nd             DRIVE dataset for Vessel Segmentation (second annotat...
  drive-768             DRIVE dataset for Vessel Segmentation (Resolution use...
  drive-covd            COVD-DRIVE for Vessel Segmentation
  drive-mtest           DRIVE cross-evaluation dataset with matched resolutio...
  drive-xtest           DRIVE cross-evaluation dataset
  hrf                   HRF dataset for Vessel Segmentation (default protocol...
  hrf-1024              HRF dataset for Vessel Segmentation
  hrf-768               HRF dataset for Vessel Segmentation
  hrf-covd              COVD-HRF for Vessel Segmentation
  hrf-highres           HRF dataset for Vessel Segmentation (default protocol...
  hrf-mtest             HRF cross-evaluation dataset with matched resolution
  hrf-xtest             HRF cross-evaluation dataset
  iostar-disc           IOSTAR dataset for Optic Disc Segmentation (default p...
  iostar-disc-512       IOSTAR dataset for Optic Disc Segmentation
  iostar-disc-768       IOSTAR dataset for Optic Disc Segmentation
  iostar-vessel         IOSTAR dataset for Vessel Segmentation (default proto...
  iostar-vessel-768     IOSTAR dataset for Vessel Segmentation (default proto...
  iostar-vessel-covd    COVD-IOSTAR for Vessel Segmentation
  iostar-vessel-mtest   IOSTAR vessel cross-evaluation dataset with matched r...
  iostar-vessel-xtest   IOSTAR vessel cross-evaluation dataset
  jsrt                  Japanese Society of Radiological Technology dataset f...
  jsrt-xtest            JSRT CXR cross-evaluation dataset
  montgomery            Montgomery County dataset for Lung Segmentation (defa...
  montgomery-xtest      Montgomery County cross-evaluation dataset
  refuge-cup            REFUGE dataset for Optic Cup Segmentation (default pr...
  refuge-cup-512        REFUGE dataset for Optic Cup Segmentation
  refuge-cup-768        REFUGE dataset for Optic Cup Segmentation
  refuge-disc           REFUGE dataset for Optic Disc Segmentation (default p...
  refuge-disc-512       DRISHTI-GS1 dataset for Optic Disc Segmentation (agre...
  refuge-disc-768       DRISHTI-GS1 dataset for Optic Disc Segmentation (agre...
  rimoner3-cup          RIM-ONE r3 for Optic Cup Segmentation (expert #1 anno...
  rimoner3-cup-2nd      RIM-ONE r3 for Optic Cup Segmentation (expert #2 anno...
  rimoner3-cup-512      RIM-ONE r3 for Optic Cup Segmentation (expert #1 anno...
  rimoner3-cup-768      RIM-ONE r3 for Optic Cup Segmentation (expert #1 anno...
  rimoner3-disc         RIM-ONE r3 for Optic Disc Segmentation (expert #1 ann...
  rimoner3-disc-2nd     RIM-ONE r3 for Optic Disc Segmentation (expert #2 ann...
  rimoner3-disc-512     RIM-ONE r3 for Optic Disc Segmentation (expert #1 ann...
  rimoner3-disc-768     RIM-ONE r3 for Optic Disc Segmentation (expert #1 ann...
  shenzhen              Shenzhen dataset for Lung Segmentation (default proto...
  shenzhen-small        Shenzhen dataset for Lung Segmentation (default proto...
  shenzhen-xtest        Shenzhen cross-evaluation dataset
  stare                 STARE dataset for Vessel Segmentation (annotator AH)
  stare-1024            STARE dataset for Vessel Segmentation (annotator AH)
  stare-2nd             STARE dataset for Vessel Segmentation (annotator VK)
  stare-768             STARE dataset for Vessel Segmentation (annotator AH)
  stare-covd            COVD-STARE for Vessel Segmentation
  stare-mtest           STARE cross-evaluation dataset with matched resolutio...
  stare-xtest           STARE cross-evaluation dataset
module: bob.ip.binseg.configs.models
  driu      DRIU Network for Vessel Segmentation
  driu-bn   DRIU Network for Vessel Segmentation with Batch Normalization
  driu-od   DRIU Network for Optic Disc Segmentation
  hed       HED Network for image segmentation
  lwnet     Little W-Net for image segmentation
  m2unet    MobileNetV2 U-Net model for image segmentation
  resunet   Residual U-Net for image segmentation
  unet      U-Net for image segmentation

Describing a Resource

$ bob binseg config describe --help
Usage: bob binseg config describe [OPTIONS] NAME...

  Describe a specific configuration file.

Options:
  -v, --verbose   Increase the verbosity level from 0 (only error messages) to
                  1 (warnings), 2 (log messages), 3 (debug information) by
                  adding the --verbose option as often as desired (e.g. '-vvv'
                  for debug).
  -?, -h, --help  Show this message and exit.

  Examples:

    1. Describes the DRIVE (training) dataset configuration:

       $ bob binseg config describe drive

    2. Describes the DRIVE (training) dataset configuration and lists its
       contents:

       $ bob binseg config describe drive -v

Copying a Resource

You may use this command to locally copy a resource file so you can change it.

$ bob binseg config copy --help
Usage: bob binseg config copy [OPTIONS] SOURCE DESTINATION

  Copy a specific configuration resource so it can be modified locally.

Options:
  -v, --verbose   Increase the verbosity level from 0 (only error messages) to
                  1 (warnings), 2 (log messages), 3 (debug information) by
                  adding the --verbose option as often as desired (e.g. '-vvv'
                  for debug).
  -?, -h, --help  Show this message and exit.

  Examples:

    1. Makes a copy of one of the stock configuration files locally, so it can be
       adapted:

       $ bob binseg config copy drive -vvv newdataset.py

Running and Analyzing Experiments

These applications run a combined set of steps in one go. They work well with our preset configuration resources.

Running a Full Experiment Cycle

This command can run training, prediction, evaluation and comparison from a single, multi-step application.

$ bob binseg experiment --help
Usage: bob binseg experiment [OPTIONS] [CONFIG]...

  Runs a complete experiment, from training, to prediction and evaluation

          This script is just a wrapper around the individual scripts for
          training,         running prediction, evaluating and comparing FCN
          model performance.  It         organises the output in a preset
          way::

                 └─ <output-folder>/
                    ├── model/  #the generated model will be here
                    ├── predictions/  #the prediction outputs for the train/test set
                    ├── overlayed/  #the overlayed outputs for the train/test set
                       ├── predictions/  #predictions overlayed on the input images
                       ├── analysis/  #predictions overlayed on the input images
                       ├              #including analysis of false positives, negatives
                       ├              #and true positives
                       └── second-annotator/  #if set, store overlayed images for the
                                              #second annotator here
                    └── analysis /  #the outputs of the analysis of both train/test sets
                                    #includes second-annotator "mesures" as well, if
                                    # configured

          Training is performed for a configurable number of epochs, and
          generates at         least a final_model.pth.  It may also generate
          a number of intermediate         checkpoints.  Checkpoints are model
          files (.pth files) that are stored         during the training and
          useful to resume the procedure in case it stops         abruptly.

          N.B.: The tool is designed to prevent analysis bias and allows one
          to         provide (potentially multiple) separate subsets for
          training,         validation, and evaluation.  Instead of using
          simple datasets, datasets         for full experiment running should
          be dictionaries with specific subset         names:

          * ``__train__``: dataset used for training, prioritarily.  It is
          typically           the dataset containing data augmentation
          pipelines.         * ``__valid__``: dataset used for validation.  It
          is typically disjoint           from the training and test sets.  In
          such a case, we checkpoint the model           with the lowest loss
          on the validation set as well, throughout all the
          training, besides the model at the end of training.         *
          ``train`` (optional): a copy of the ``__train__`` dataset, without
          data           augmentation, that will be evaluated alongside other
          sets available         * ``__valid_extra__``: a list of datasets
          that are tracked during           validation, but do not affect
          checkpoiting. If present, an extra           column with an array
          containing the loss of each set is kept on the           training
          log.         * ``*``: any other name, not starting with an
          underscore character (``_``),           will be considered a test
          set for evaluation.

          N.B.2: The threshold used for calculating the F1-score on the test
          set, or         overlay analysis (false positives, negatives and
          true positives overprinted         on the original image) also
          follows the logic above.

  It is possible to pass one or several Python files (or names of
  ``bob.ip.binseg.config`` entry points or module names i.e. import paths) as
  CONFIG arguments to this command line which contain the parameters listed
  below as Python variables. Available entry points are:

  **bob.ip.binseg** entry points are: chasedb1, chasedb1-1024, chasedb1-2nd,
  chasedb1-768, chasedb1-covd, chasedb1-mtest, chasedb1-xtest, combined-cup,
  combined-disc, combined-vessels, csv-dataset-example, cxr8, cxr8-idiap,
  cxr8-idiap-xtest, cxr8-xtest, drhagis, drionsdb, drionsdb-2nd,
  drionsdb-2nd-512, drionsdb-512, drionsdb-768, drishtigs1-cup,
  drishtigs1-cup-512, drishtigs1-cup-768, drishtigs1-cup-any, drishtigs1-disc,
  drishtigs1-disc-512, drishtigs1-disc-768, drishtigs1-disc-any, driu, driu-
  bn, driu-od, drive, drive-1024, drive-2nd, drive-768, drive-covd, drive-
  mtest, drive-xtest, hed, hrf, hrf-1024, hrf-768, hrf-covd, hrf-highres, hrf-
  mtest, hrf-xtest, iostar-disc, iostar-disc-512, iostar-disc-768, iostar-
  vessel, iostar-vessel-768, iostar-vessel-covd, iostar-vessel-mtest, iostar-
  vessel-xtest, jsrt, jsrt-xtest, lwnet, m2unet, montgomery, montgomery-xtest,
  refuge-cup, refuge-cup-512, refuge-cup-768, refuge-disc, refuge-disc-512,
  refuge-disc-768, resunet, rimoner3-cup, rimoner3-cup-2nd, rimoner3-cup-512,
  rimoner3-cup-768, rimoner3-disc, rimoner3-disc-2nd, rimoner3-disc-512,
  rimoner3-disc-768, shenzhen, shenzhen-small, shenzhen-xtest, stare,
  stare-1024, stare-2nd, stare-768, stare-covd, stare-mtest, stare-xtest, unet

  The options through the command-line (see below) will override the values of
  argument provided configuration files. You can run this command with
  ``<COMMAND> -H example_config.py`` to create a template config file.

Options:
  -o, --output-folder PATH        Path where to store experiment outputs
                                  (created if does not exist)  [required]
  -m, --model CUSTOM              A torch.nn.Module instance implementing the
                                  network to be trained, and then evaluated
                                  [required]
  -d, --dataset CUSTOM            A dictionary mapping string keys to
                                  torch.utils.data.dataset.Dataset instances
                                  implementing datasets to be used for
                                  training and validating the model, possibly
                                  including all pre-processing pipelines
                                  required or, optionally, a dictionary
                                  mapping string keys to
                                  torch.utils.data.dataset.Dataset instances.
                                  At least one key named ``train`` must be
                                  available.  This dataset will be used for
                                  training the network model.  The dataset
                                  description must include all required pre-
                                  processing, including eventual data
                                  augmentation.  If a dataset named
                                  ``__train__`` is available, it is used
                                  prioritarily for training instead of
                                  ``train``.  If a dataset named ``__valid__``
                                  is available, it is used for model
                                  validation (and automatic check-pointing) at
                                  each epoch.  If a dataset list named
                                  ``__valid_extra__`` is available, then it
                                  will be tracked during the validation
                                  process and its loss output at the training
                                  log as well, in the format of an array
                                  occupying a single column.  All other keys
                                  are considered test datasets and only used
                                  during analysis, to report the final system
                                  performance  [required]
  -S, --second-annotator CUSTOM   A dataset or dictionary, like in --dataset,
                                  with the same sample keys, but with
                                  annotations from a different annotator that
                                  is going to be compared to the one in
                                  --dataset
  --optimizer CUSTOM              A torch.optim.Optimizer that will be used to
                                  train the network  [required]
  --criterion CUSTOM              A loss function to compute the FCN error for
                                  every sample respecting the PyTorch API for
                                  loss functions (see torch.nn.modules.loss)
                                  [required]
  --scheduler CUSTOM              A learning rate scheduler that drives
                                  changes in the learning rate depending on
                                  the FCN state (see torch.optim.lr_scheduler)
                                  [required]
  -b, --batch-size INTEGER RANGE  Number of samples in every batch (this
                                  parameter affects memory requirements for
                                  the network).  If the number of samples in
                                  the batch is larger than the total number of
                                  samples available for training, this value
                                  is truncated.  If this number is smaller,
                                  then batches of the specified size are
                                  created and fed to the network until there
                                  are no more new samples to feed (epoch is
                                  finished).  If the total number of training
                                  samples is not a multiple of the batch-size,
                                  the last batch will be smaller than the
                                  first, unless --drop-incomplete-batch is
                                  set, in which case this batch is not used.
                                  [default: 2; x>=1; required]
  -c, --batch-chunk-count INTEGER RANGE
                                  Number of chunks in every batch (this
                                  parameter affects memory requirements for
                                  the network). The number of samples loaded
                                  for every iteration will be batch-
                                  size/batch-chunk-count. batch-size needs to
                                  be divisible by batch-chunk-count, otherwise
                                  an error will be raised. This parameter is
                                  used to reduce number of samples loaded in
                                  each iteration, in order to reduce the
                                  memory usage in exchange for processing time
                                  (more iterations).  This is specially
                                  interesting whe one is running with GPUs
                                  with limited RAM. The default of 1 forces
                                  the whole batch to be processed at once.
                                  Otherwise the batch is broken into batch-
                                  chunk-count pieces, and gradients are
                                  accumulated to complete each batch.
                                  [default: 1; x>=1; required]
  -D, --drop-incomplete-batch / --no-drop-incomplete-batch
                                  If set, then may drop the last batch in an
                                  epoch, in case it is incomplete.  If you set
                                  this option, you should also consider
                                  increasing the total number of epochs of
                                  training, as the total number of training
                                  steps may be reduced  [default: no-drop-
                                  incomplete-batch; required]
  -e, --epochs INTEGER RANGE      Number of epochs (complete training set
                                  passes) to train for. If continuing from a
                                  saved checkpoint, ensure to provide a
                                  greater number of epochs than that saved on
                                  the checkpoint to be loaded.   [default:
                                  1000; x>=1; required]
  -p, --checkpoint-period INTEGER RANGE
                                  Number of epochs after which a checkpoint is
                                  saved. A value of zero will disable check-
                                  pointing. If checkpointing is enabled and
                                  training stops, it is automatically resumed
                                  from the last saved checkpoint if training
                                  is restarted with the same configuration.
                                  [default: 0; x>=0; required]
  -d, --device TEXT               A string indicating the device to use (e.g.
                                  "cpu" or "cuda:0")  [default: cpu; required]
  -s, --seed INTEGER RANGE        Seed to use for the random number generator
                                  [default: 42; x>=0]
  -P, --parallel INTEGER RANGE    Use multiprocessing for data loading and
                                  processing: if set to -1 (default), disables
                                  multiprocessing altogether.  Set to 0 to
                                  enable as many data loading instances as
                                  processing cores as available in the system.
                                  Set to >= 1 to enable that many
                                  multiprocessing instances for data
                                  processing.  [default: -1; x>=-1; required]
  -I, --monitoring-interval FLOAT RANGE
                                  Time between checks for the use of resources
                                  during each training epoch.  An interval of
                                  5 seconds, for example, will lead to CPU and
                                  GPU resources being probed every 5 seconds
                                  during each training epoch. Values
                                  registered in the training logs correspond
                                  to averages (or maxima) observed through
                                  possibly many probes in each epoch.  Notice
                                  that setting a very small value may cause
                                  the probing process to become extremely
                                  busy, potentially biasing the overall
                                  perception of resource usage.  [default:
                                  5.0; x>=0.1; required]
  -O, --overlayed / --no-overlayed
                                  Creates overlayed representations of the
                                  output probability maps, similar to
                                  --overlayed in prediction-mode, except it
                                  includes distinctive colours for true and
                                  false positives and false negatives.  If not
                                  set, or empty then do **NOT** output
                                  overlayed images.  [default: no-overlayed]
  -S, --steps INTEGER             This number is used to define the number of
                                  threshold steps to consider when evaluating
                                  the highest possible F1-score on test data.
                                  [default: 1000; required]
  -L, --plot-limits FLOAT...      If set, this option affects the performance
                                  comparison plots.  It must be a 4-tuple
                                  containing the bounds of the plot for the x
                                  and y axis respectively (format: x_low,
                                  x_high, y_low, y_high]).  If not set, use
                                  normal bounds ([0, 1, 0, 1]) for the
                                  performance curve.  [default: 0.0, 1.0, 0.0,
                                  1.0]
  -v, --verbose                   Increase the verbosity level from 0 (only
                                  error messages) to 1 (warnings), 2 (log
                                  messages), 3 (debug information) by adding
                                  the --verbose option as often as desired
                                  (e.g. '-vvv' for debug).
  -H, --dump-config FILENAME      Name of the config file to be generated
  -?, -h, --help                  Show this message and exit.

  Examples:

      1. Trains an M2U-Net model (VGG-16 backbone) with DRIVE (vessel
         segmentation), on the CPU, for only two epochs, then runs inference and
         evaluation on stock datasets, report performance as a table and a figure:

         $ bob binseg experiment -vv m2unet drive --epochs=2

Running Complete Experiment Analysis

This command can run prediction, evaluation and comparison from a single, multi-step application.

$ bob binseg analyze --help
Usage: bob binseg analyze [OPTIONS] [CONFIG]...

  Runs a complete evaluation from prediction to comparison

          This script is just a wrapper around the individual scripts for
          running         prediction and evaluating FCN models.  It organises
          the output in a         preset way::

                 └─ <output-folder>/
                    ├── predictions/  #the prediction outputs for the train/test set
                    ├── overlayed/  #the overlayed outputs for the train/test set
                       ├── predictions/  #predictions overlayed on the input images
                       ├── analysis/  #predictions overlayed on the input images
                       ├              #including analysis of false positives, negatives
                       ├              #and true positives
                       └── second-annotator/  #if set, store overlayed images for the
                                              #second annotator here
                    └── analysis /  #the outputs of the analysis of both train/test sets
                                    #includes second-annotator "mesures" as well, if
                                    # configured

          N.B.: The tool is designed to prevent analysis bias and allows one
          to         provide separate subsets for training and evaluation.
          Instead of using         simple datasets, datasets for full
          experiment running should be         dictionaries with specific
          subset names:

          * ``__train__``: dataset used for training, prioritarily.  It is
          typically           the dataset containing data augmentation
          pipelines.         * ``train`` (optional): a copy of the
          ``__train__`` dataset, without data           augmentation, that
          will be evaluated alongside other sets available         * ``*``:
          any other name, not starting with an underscore character (``_``),
          will be considered a test set for evaluation.

          N.B.2: The threshold used for calculating the F1-score on the test
          set, or         overlay analysis (false positives, negatives and
          true positives overprinted         on the original image) also
          follows the logic above.

  It is possible to pass one or several Python files (or names of
  ``bob.ip.binseg.config`` entry points or module names i.e. import paths) as
  CONFIG arguments to this command line which contain the parameters listed
  below as Python variables. Available entry points are:

  **bob.ip.binseg** entry points are: chasedb1, chasedb1-1024, chasedb1-2nd,
  chasedb1-768, chasedb1-covd, chasedb1-mtest, chasedb1-xtest, combined-cup,
  combined-disc, combined-vessels, csv-dataset-example, cxr8, cxr8-idiap,
  cxr8-idiap-xtest, cxr8-xtest, drhagis, drionsdb, drionsdb-2nd,
  drionsdb-2nd-512, drionsdb-512, drionsdb-768, drishtigs1-cup,
  drishtigs1-cup-512, drishtigs1-cup-768, drishtigs1-cup-any, drishtigs1-disc,
  drishtigs1-disc-512, drishtigs1-disc-768, drishtigs1-disc-any, driu, driu-
  bn, driu-od, drive, drive-1024, drive-2nd, drive-768, drive-covd, drive-
  mtest, drive-xtest, hed, hrf, hrf-1024, hrf-768, hrf-covd, hrf-highres, hrf-
  mtest, hrf-xtest, iostar-disc, iostar-disc-512, iostar-disc-768, iostar-
  vessel, iostar-vessel-768, iostar-vessel-covd, iostar-vessel-mtest, iostar-
  vessel-xtest, jsrt, jsrt-xtest, lwnet, m2unet, montgomery, montgomery-xtest,
  refuge-cup, refuge-cup-512, refuge-cup-768, refuge-disc, refuge-disc-512,
  refuge-disc-768, resunet, rimoner3-cup, rimoner3-cup-2nd, rimoner3-cup-512,
  rimoner3-cup-768, rimoner3-disc, rimoner3-disc-2nd, rimoner3-disc-512,
  rimoner3-disc-768, shenzhen, shenzhen-small, shenzhen-xtest, stare,
  stare-1024, stare-2nd, stare-768, stare-covd, stare-mtest, stare-xtest, unet

  The options through the command-line (see below) will override the values of
  argument provided configuration files. You can run this command with
  ``<COMMAND> -H example_config.py`` to create a template config file.

Options:
  -o, --output-folder PATH        Path where to store experiment outputs
                                  (created if does not exist)  [required]
  -m, --model CUSTOM              A torch.nn.Module instance implementing the
                                  network to be trained, and then evaluated
                                  [required]
  -d, --dataset CUSTOM            A dictionary mapping string keys to bob.ip.c
                                  ommon.data.utils.SampleList2TorchDataset's.
                                  At least one key named 'train' must be
                                  available.  This dataset will be used for
                                  training the network model.  All other
                                  datasets will be used for prediction and
                                  evaluation. Dataset descriptions include all
                                  required pre-processing, including eventual
                                  data augmentation, which may be eventually
                                  excluded for prediction and evaluation
                                  purposes  [required]
  -S, --second-annotator CUSTOM   A dataset or dictionary, like in --dataset,
                                  with the same sample keys, but with
                                  annotations from a different annotator that
                                  is going to be compared to the one in
                                  --dataset
  -b, --batch-size INTEGER RANGE  Number of samples in every batch (this
                                  parameter affects memory requirements for
                                  the network).  If the number of samples in
                                  the batch is larger than the total number of
                                  samples available for training, this value
                                  is truncated.  If this number is smaller,
                                  then batches of the specified size are
                                  created and fed to the network until there
                                  are no more new samples to feed (epoch is
                                  finished).  If the total number of training
                                  samples is not a multiple of the batch-size,
                                  the last batch will be smaller than the
                                  first.  [default: 1; x>=1; required]
  -d, --device TEXT               A string indicating the device to use (e.g.
                                  "cpu" or "cuda:0")  [default: cpu; required]
  -O, --overlayed / --no-overlayed
                                  Creates overlayed representations of the
                                  output probability maps, similar to
                                  --overlayed in prediction-mode, except it
                                  includes distinctive colours for true and
                                  false positives and false negatives.  If not
                                  set, or empty then do **NOT** output
                                  overlayed images.  [default: no-overlayed]
  -w, --weight CUSTOM             Path or URL to pretrained model file (.pth
                                  extension)  [required]
  -S, --steps INTEGER             This number is used to define the number of
                                  threshold steps to consider when evaluating
                                  the highest possible F1-score on test data.
                                  [default: 1000; required]
  -P, --parallel INTEGER RANGE    Use multiprocessing for data processing: if
                                  set to -1 (default), disables
                                  multiprocessing.  Set to 0 to enable as many
                                  data loading instances as processing cores
                                  as available in the system.  Set to >= 1 to
                                  enable that many multiprocessing instances
                                  for data processing.  [default: -1; x>=-1;
                                  required]
  -L, --plot-limits FLOAT...      If set, this option affects the performance
                                  comparison plots.  It must be a 4-tuple
                                  containing the bounds of the plot for the x
                                  and y axis respectively (format: x_low,
                                  x_high, y_low, y_high]).  If not set, use
                                  normal bounds ([0, 1, 0, 1]) for the
                                  performance curve.  [default: 0.0, 1.0, 0.0,
                                  1.0]
  -v, --verbose                   Increase the verbosity level from 0 (only
                                  error messages) to 1 (warnings), 2 (log
                                  messages), 3 (debug information) by adding
                                  the --verbose option as often as desired
                                  (e.g. '-vvv' for debug).
  -H, --dump-config FILENAME      Name of the config file to be generated
  -h, -?, --help                  Show this message and exit.

  Examples:

      1. Re-evaluates a pre-trained M2U-Net model with DRIVE (vessel
      segmentation), on the CPU, by running inference and evaluation on results
      from its test set:

         $ bob binseg analyze -vv m2unet drive --weight=model.path

Single-Step Applications

These applications allow finer control over the experiment cycle. They also work well with our preset configuration resources, but allow finer control on the input datasets.

Training FCNs

Training creates of a new PyTorch model. This model can be used for evaluation tests or for inference.

$ bob binseg train --help
Usage: bob binseg train [OPTIONS] [CONFIG]...

  Trains an FCN to perform binary segmentation.

      Training is performed for a configurable number of epochs, and generates
      at     least a final_model.pth.  It may also generate a number of
      intermediate     checkpoints.  Checkpoints are model files (.pth files)
      that are stored     during the training and useful to resume the
      procedure in case it stops     abruptly.

      Tip: In case the model has been trained over a number of epochs, it is
      possible to continue training, by simply relaunching the same command,
      and     changing the number of epochs to a number greater than the
      number where     the original training session stopped (or the last
      checkpoint was saved).

  It is possible to pass one or several Python files (or names of
  ``bob.ip.binseg.config`` entry points or module names i.e. import paths) as
  CONFIG arguments to this command line which contain the parameters listed
  below as Python variables. Available entry points are:

  **bob.ip.binseg** entry points are: chasedb1, chasedb1-1024, chasedb1-2nd,
  chasedb1-768, chasedb1-covd, chasedb1-mtest, chasedb1-xtest, combined-cup,
  combined-disc, combined-vessels, csv-dataset-example, cxr8, cxr8-idiap,
  cxr8-idiap-xtest, cxr8-xtest, drhagis, drionsdb, drionsdb-2nd,
  drionsdb-2nd-512, drionsdb-512, drionsdb-768, drishtigs1-cup,
  drishtigs1-cup-512, drishtigs1-cup-768, drishtigs1-cup-any, drishtigs1-disc,
  drishtigs1-disc-512, drishtigs1-disc-768, drishtigs1-disc-any, driu, driu-
  bn, driu-od, drive, drive-1024, drive-2nd, drive-768, drive-covd, drive-
  mtest, drive-xtest, hed, hrf, hrf-1024, hrf-768, hrf-covd, hrf-highres, hrf-
  mtest, hrf-xtest, iostar-disc, iostar-disc-512, iostar-disc-768, iostar-
  vessel, iostar-vessel-768, iostar-vessel-covd, iostar-vessel-mtest, iostar-
  vessel-xtest, jsrt, jsrt-xtest, lwnet, m2unet, montgomery, montgomery-xtest,
  refuge-cup, refuge-cup-512, refuge-cup-768, refuge-disc, refuge-disc-512,
  refuge-disc-768, resunet, rimoner3-cup, rimoner3-cup-2nd, rimoner3-cup-512,
  rimoner3-cup-768, rimoner3-disc, rimoner3-disc-2nd, rimoner3-disc-512,
  rimoner3-disc-768, shenzhen, shenzhen-small, shenzhen-xtest, stare,
  stare-1024, stare-2nd, stare-768, stare-covd, stare-mtest, stare-xtest, unet

  The options through the command-line (see below) will override the values of
  argument provided configuration files. You can run this command with
  ``<COMMAND> -H example_config.py`` to create a template config file.

Options:
  -o, --output-folder PATH        Path where to store the generated model
                                  (created if does not exist)  [required]
  -m, --model CUSTOM              A torch.nn.Module instance implementing the
                                  network to be trained  [required]
  -d, --dataset CUSTOM            A dictionary mapping string keys to
                                  torch.utils.data.dataset.Dataset instances
                                  implementing datasets to be used for
                                  training and validating the model, possibly
                                  including all pre-processing pipelines
                                  required or, optionally, a dictionary
                                  mapping string keys to
                                  torch.utils.data.dataset.Dataset instances.
                                  At least one key named ``train`` must be
                                  available.  This dataset will be used for
                                  training the network model.  The dataset
                                  description must include all required pre-
                                  processing, including eventual data
                                  augmentation.  If a dataset named
                                  ``__train__`` is available, it is used
                                  prioritarily for training instead of
                                  ``train``.  If a dataset named ``__valid__``
                                  is available, it is used for model
                                  validation (and automatic check-pointing) at
                                  each epoch.  If a dataset list named
                                  ``__extra_valid__`` is available, then it
                                  will be tracked during the validation
                                  process and its loss output at the training
                                  log as well, in the format of an array
                                  occupying a single column.  All other keys
                                  are considered test datasets and are ignored
                                  during training  [required]
  --optimizer CUSTOM              A torch.optim.Optimizer that will be used to
                                  train the network  [required]
  --criterion CUSTOM              A loss function to compute the FCN error for
                                  every sample respecting the PyTorch API for
                                  loss functions (see torch.nn.modules.loss)
                                  [required]
  --scheduler CUSTOM              A learning rate scheduler that drives
                                  changes in the learning rate depending on
                                  the FCN state (see torch.optim.lr_scheduler)
                                  [required]
  -b, --batch-size INTEGER RANGE  Number of samples in every batch (this
                                  parameter affects memory requirements for
                                  the network).  If the number of samples in
                                  the batch is larger than the total number of
                                  samples available for training, this value
                                  is truncated.  If this number is smaller,
                                  then batches of the specified size are
                                  created and fed to the network until there
                                  are no more new samples to feed (epoch is
                                  finished).  If the total number of training
                                  samples is not a multiple of the batch-size,
                                  the last batch will be smaller than the
                                  first, unless --drop-incomplete-batch is
                                  set, in which case this batch is not used.
                                  [default: 2; x>=1; required]
  -c, --batch-chunk-count INTEGER RANGE
                                  Number of chunks in every batch (this
                                  parameter affects memory requirements for
                                  the network). The number of samples loaded
                                  for every iteration will be batch-
                                  size/batch-chunk-count. batch-size needs to
                                  be divisible by batch-chunk-count, otherwise
                                  an error will be raised. This parameter is
                                  used to reduce number of samples loaded in
                                  each iteration, in order to reduce the
                                  memory usage in exchange for processing time
                                  (more iterations).  This is specially
                                  interesting whe one is running with GPUs
                                  with limited RAM. The default of 1 forces
                                  the whole batch to be processed at once.
                                  Otherwise the batch is broken into batch-
                                  chunk-count pieces, and gradients are
                                  accumulated to complete each batch.
                                  [default: 1; x>=1; required]
  -D, --drop-incomplete-batch / --no-drop-incomplete-batch
                                  If set, then may drop the last batch in an
                                  epoch, in case it is incomplete.  If you set
                                  this option, you should also consider
                                  increasing the total number of epochs of
                                  training, as the total number of training
                                  steps may be reduced  [default: no-drop-
                                  incomplete-batch; required]
  -e, --epochs INTEGER RANGE      Number of epochs (complete training set
                                  passes) to train for. If continuing from a
                                  saved checkpoint, ensure to provide a
                                  greater number of epochs than that saved on
                                  the checkpoint to be loaded.   [default:
                                  1000; x>=1; required]
  -p, --checkpoint-period INTEGER RANGE
                                  Number of epochs after which a checkpoint is
                                  saved. A value of zero will disable check-
                                  pointing. If checkpointing is enabled and
                                  training stops, it is automatically resumed
                                  from the last saved checkpoint if training
                                  is restarted with the same configuration.
                                  [default: 0; x>=0; required]
  -d, --device TEXT               A string indicating the device to use (e.g.
                                  "cpu" or "cuda:0")  [default: cpu; required]
  -s, --seed INTEGER RANGE        Seed to use for the random number generator
                                  [default: 42; x>=0]
  -P, --parallel INTEGER RANGE    Use multiprocessing for data loading: if set
                                  to -1 (default), disables multiprocessing
                                  data loading.  Set to 0 to enable as many
                                  data loading instances as processing cores
                                  as available in the system.  Set to >= 1 to
                                  enable that many multiprocessing instances
                                  for data loading.  [default: -1; x>=-1;
                                  required]
  -I, --monitoring-interval FLOAT RANGE
                                  Time between checks for the use of resources
                                  during each training epoch.  An interval of
                                  5 seconds, for example, will lead to CPU and
                                  GPU resources being probed every 5 seconds
                                  during each training epoch. Values
                                  registered in the training logs correspond
                                  to averages (or maxima) observed through
                                  possibly many probes in each epoch.  Notice
                                  that setting a very small value may cause
                                  the probing process to become extremely
                                  busy, potentially biasing the overall
                                  perception of resource usage.  [default:
                                  5.0; x>=0.1; required]
  -v, --verbose                   Increase the verbosity level from 0 (only
                                  error messages) to 1 (warnings), 2 (log
                                  messages), 3 (debug information) by adding
                                  the --verbose option as often as desired
                                  (e.g. '-vvv' for debug).
  -H, --dump-config FILENAME      Name of the config file to be generated
  -h, -?, --help                  Show this message and exit.

  Examples:

      1. Trains a U-Net model (VGG-16 backbone) with DRIVE (vessel segmentation),
         on a GPU (``cuda:0``):

         $ bob binseg train -vv unet drive --batch-size=4 --device="cuda:0"

      2. Trains a HED model with HRF on a GPU (``cuda:0``):

         $ bob binseg train -vv hed hrf --batch-size=8 --device="cuda:0"

      3. Trains a M2U-Net model on the COVD-DRIVE dataset on the CPU:

         $ bob binseg train -vv m2unet covd-drive --batch-size=8

Prediction with FCNs

Inference takes as input a PyTorch model and generates output probabilities as HDF5 files. The probability map has the same size as the input and indicates, from 0 to 1 (floating-point number), the probability of a vessel in that pixel, from less probable (0.0) to more probable (1.0).

$ bob binseg predict --help
Usage: bob binseg predict [OPTIONS] [CONFIG]...

  Predicts vessel map (probabilities) on input images.

  It is possible to pass one or several Python files (or names of
  ``bob.ip.binseg.config`` entry points or module names i.e. import paths) as
  CONFIG arguments to this command line which contain the parameters listed
  below as Python variables. Available entry points are:

  **bob.ip.binseg** entry points are: chasedb1, chasedb1-1024, chasedb1-2nd,
  chasedb1-768, chasedb1-covd, chasedb1-mtest, chasedb1-xtest, combined-cup,
  combined-disc, combined-vessels, csv-dataset-example, cxr8, cxr8-idiap,
  cxr8-idiap-xtest, cxr8-xtest, drhagis, drionsdb, drionsdb-2nd,
  drionsdb-2nd-512, drionsdb-512, drionsdb-768, drishtigs1-cup,
  drishtigs1-cup-512, drishtigs1-cup-768, drishtigs1-cup-any, drishtigs1-disc,
  drishtigs1-disc-512, drishtigs1-disc-768, drishtigs1-disc-any, driu, driu-
  bn, driu-od, drive, drive-1024, drive-2nd, drive-768, drive-covd, drive-
  mtest, drive-xtest, hed, hrf, hrf-1024, hrf-768, hrf-covd, hrf-highres, hrf-
  mtest, hrf-xtest, iostar-disc, iostar-disc-512, iostar-disc-768, iostar-
  vessel, iostar-vessel-768, iostar-vessel-covd, iostar-vessel-mtest, iostar-
  vessel-xtest, jsrt, jsrt-xtest, lwnet, m2unet, montgomery, montgomery-xtest,
  refuge-cup, refuge-cup-512, refuge-cup-768, refuge-disc, refuge-disc-512,
  refuge-disc-768, resunet, rimoner3-cup, rimoner3-cup-2nd, rimoner3-cup-512,
  rimoner3-cup-768, rimoner3-disc, rimoner3-disc-2nd, rimoner3-disc-512,
  rimoner3-disc-768, shenzhen, shenzhen-small, shenzhen-xtest, stare,
  stare-1024, stare-2nd, stare-768, stare-covd, stare-mtest, stare-xtest, unet

  The options through the command-line (see below) will override the values of
  argument provided configuration files. You can run this command with
  ``<COMMAND> -H example_config.py`` to create a template config file.

Options:
  -o, --output-folder PATH        Path where to store the predictions (created
                                  if does not exist)  [required]
  -m, --model CUSTOM              A torch.nn.Module instance implementing the
                                  network to be evaluated  [required]
  -d, --dataset CUSTOM            A torch.utils.data.dataset.Dataset instance
                                  implementing a dataset to be used for
                                  running prediction, possibly including all
                                  pre-processing pipelines required or,
                                  optionally, a dictionary mapping string keys
                                  to torch.utils.data.dataset.Dataset
                                  instances.  All keys that do not start with
                                  an underscore (_) will be processed.
                                  [required]
  -b, --batch-size INTEGER RANGE  Number of samples in every batch (this
                                  parameter affects memory requirements for
                                  the network)  [default: 1; x>=1; required]
  -d, --device TEXT               A string indicating the device to use (e.g.
                                  "cpu" or "cuda:0")  [default: cpu; required]
  -w, --weight CUSTOM             Path or URL to pretrained model file (.pth
                                  extension)  [required]
  -O, --overlayed CUSTOM          Creates overlayed representations of the
                                  output probability maps on top of input
                                  images (store results as PNG files).   If
                                  not set, or empty then do **NOT** output
                                  overlayed images.  Otherwise, the parameter
                                  represents the name of a folder where to
                                  store those
  -P, --parallel INTEGER RANGE    Use multiprocessing for data loading: if set
                                  to -1 (default), disables multiprocessing
                                  data loading.  Set to 0 to enable as many
                                  data loading instances as processing cores
                                  as available in the system.  Set to >= 1 to
                                  enable that many multiprocessing instances
                                  for data loading.  [default: -1; x>=-1;
                                  required]
  -v, --verbose                   Increase the verbosity level from 0 (only
                                  error messages) to 1 (warnings), 2 (log
                                  messages), 3 (debug information) by adding
                                  the --verbose option as often as desired
                                  (e.g. '-vvv' for debug).
  -H, --dump-config FILENAME      Name of the config file to be generated
  -?, -h, --help                  Show this message and exit.

  Examples:

      1. Runs prediction on an existing dataset configuration:
  
         $ bob binseg predict -vv m2unet drive --weight=path/to/model_final_epoch.pth --output-folder=path/to/predictions
  
      2. To run prediction on a folder with your own images, you must first
         specify resizing, cropping, etc, so that the image can be correctly
         input to the model.  Failing to do so will likely result in poor
         performance.  To figure out such specifications, you must consult the
         dataset configuration used for **training** the provided model.  Once
         you figured this out, do the following:
  
         $ bob binseg config copy csv-dataset-example mydataset.py
         # modify "mydataset.py" to include the base path and required transforms
         $ bob binseg predict -vv m2unet mydataset.py --weight=path/to/model_final_epoch.pth --output-folder=path/to/predictions

FCN Performance Evaluation

Evaluation takes inference results and compares it to ground-truth, generating a series of analysis figures which are useful to understand model performance.

$ bob binseg evaluate --help
Usage: bob binseg evaluate [OPTIONS] [CONFIG]...

  Evaluate an FCN on a binary segmentation task.

  It is possible to pass one or several Python files (or names of
  ``bob.ip.binseg.config`` entry points or module names i.e. import paths) as
  CONFIG arguments to this command line which contain the parameters listed
  below as Python variables. Available entry points are:

  **bob.ip.binseg** entry points are: chasedb1, chasedb1-1024, chasedb1-2nd,
  chasedb1-768, chasedb1-covd, chasedb1-mtest, chasedb1-xtest, combined-cup,
  combined-disc, combined-vessels, csv-dataset-example, cxr8, cxr8-idiap,
  cxr8-idiap-xtest, cxr8-xtest, drhagis, drionsdb, drionsdb-2nd,
  drionsdb-2nd-512, drionsdb-512, drionsdb-768, drishtigs1-cup,
  drishtigs1-cup-512, drishtigs1-cup-768, drishtigs1-cup-any, drishtigs1-disc,
  drishtigs1-disc-512, drishtigs1-disc-768, drishtigs1-disc-any, driu, driu-
  bn, driu-od, drive, drive-1024, drive-2nd, drive-768, drive-covd, drive-
  mtest, drive-xtest, hed, hrf, hrf-1024, hrf-768, hrf-covd, hrf-highres, hrf-
  mtest, hrf-xtest, iostar-disc, iostar-disc-512, iostar-disc-768, iostar-
  vessel, iostar-vessel-768, iostar-vessel-covd, iostar-vessel-mtest, iostar-
  vessel-xtest, jsrt, jsrt-xtest, lwnet, m2unet, montgomery, montgomery-xtest,
  refuge-cup, refuge-cup-512, refuge-cup-768, refuge-disc, refuge-disc-512,
  refuge-disc-768, resunet, rimoner3-cup, rimoner3-cup-2nd, rimoner3-cup-512,
  rimoner3-cup-768, rimoner3-disc, rimoner3-disc-2nd, rimoner3-disc-512,
  rimoner3-disc-768, shenzhen, shenzhen-small, shenzhen-xtest, stare,
  stare-1024, stare-2nd, stare-768, stare-covd, stare-mtest, stare-xtest, unet

  The options through the command-line (see below) will override the values of
  argument provided configuration files. You can run this command with
  ``<COMMAND> -H example_config.py`` to create a template config file.

Options:
  -o, --output-folder PATH        Path where to store the analysis result
                                  (created if does not exist)  [required]
  -p, --predictions-folder DIRECTORY
                                  Path where predictions are currently stored
                                  [required]
  -d, --dataset CUSTOM            A torch.utils.data.dataset.Dataset instance
                                  implementing a dataset to be used for
                                  evaluation purposes, possibly including all
                                  pre-processing pipelines required or,
                                  optionally, a dictionary mapping string keys
                                  to torch.utils.data.dataset.Dataset
                                  instances.  All keys that do not start with
                                  an underscore (_) will be processed.
                                  [required]
  -S, --second-annotator CUSTOM   A dataset or dictionary, like in --dataset,
                                  with the same sample keys, but with
                                  annotations from a different annotator that
                                  is going to be compared to the one in
                                  --dataset.  The same rules regarding dataset
                                  naming conventions apply
  -O, --overlayed CUSTOM          Creates overlayed representations of the
                                  output probability maps, similar to
                                  --overlayed in prediction-mode, except it
                                  includes distinctive colours for true and
                                  false positives and false negatives.  If not
                                  set, or empty then do **NOT** output
                                  overlayed images.  Otherwise, the parameter
                                  represents the name of a folder where to
                                  store those
  -t, --threshold CUSTOM          This number is used to define positives and
                                  negatives from probability maps, and report
                                  F1-scores (a priori). It should either come
                                  from the training set or a separate
                                  validation set to avoid biasing the
                                  analysis.  Optionally, if you provide a
                                  multi-set dataset as input, this may also be
                                  the name of an existing set from which the
                                  threshold will be estimated (highest
                                  F1-score) and then applied to the subsequent
                                  sets.  This number is also used to print the
                                  test set F1-score a priori performance
  -S, --steps INTEGER             This number is used to define the number of
                                  threshold steps to consider when evaluating
                                  the highest possible F1-score on test data.
                                  [default: 1000; required]
  -P, --parallel INTEGER RANGE    Use multiprocessing for data processing: if
                                  set to -1 (default), disables
                                  multiprocessing.  Set to 0 to enable as many
                                  data loading instances as processing cores
                                  as available in the system.  Set to >= 1 to
                                  enable that many multiprocessing instances
                                  for data processing.  [default: -1; x>=-1;
                                  required]
  -v, --verbose                   Increase the verbosity level from 0 (only
                                  error messages) to 1 (warnings), 2 (log
                                  messages), 3 (debug information) by adding
                                  the --verbose option as often as desired
                                  (e.g. '-vvv' for debug).
  -H, --dump-config FILENAME      Name of the config file to be generated
  -h, -?, --help                  Show this message and exit.

  Examples:

      1. Runs evaluation on an existing dataset configuration:
  
         $ bob binseg evaluate -vv drive --predictions-folder=path/to/predictions --output-folder=path/to/results
  
      2. To run evaluation on a folder with your own images and annotations, you
         must first specify resizing, cropping, etc, so that the image can be
         correctly input to the model.  Failing to do so will likely result in
         poor performance.  To figure out such specifications, you must consult
         the dataset configuration used for **training** the provided model.
         Once you figured this out, do the following:
  
         $ bob binseg config copy csv-dataset-example mydataset.py
         # modify "mydataset.py" to your liking
         $ bob binseg evaluate -vv mydataset.py --predictions-folder=path/to/predictions --output-folder=path/to/results

Performance Comparison

Performance comparison takes the performance evaluation results and generate combined figures and tables that compare results of multiple systems.

$ bob binseg compare --help
Usage: bob binseg compare [OPTIONS] [LABEL_PATH]...

  Compare multiple systems together.

Options:
  -f, --output-figure FILE        Path where write the output figure (any
                                  extension supported by matplotlib is
                                  possible).  If not provided, does not
                                  produce a figure.
  -T, --table-format [asciidoc|double_grid|double_outline|fancy_grid|fancy_outline|github|grid|heavy_grid|heavy_outline|html|jira|latex|latex_booktabs|latex_longtable|latex_raw|mediawiki|mixed_grid|mixed_outline|moinmoin|orgtbl|outline|pipe|plain|presto|pretty|psql|rounded_grid|rounded_outline|rst|simple|simple_grid|simple_outline|textile|tsv|unsafehtml|youtrack]
                                  The format to use for the comparison table
                                  [default: rst; required]
  -u, --output-table FILE         Path where write the output table. If not
                                  provided, does not write write a table to
                                  file, only to stdout.
  -t, --threshold TEXT            This number is used to select which F1-score
                                  to use for representing a system
                                  performance.  If not set, we report the
                                  maximum F1-score in the set, which is
                                  equivalent to threshold selection a
                                  posteriori (biased estimator), unless the
                                  performance file being considered already
                                  was pre-tunned, and contains a
                                  'threshold_a_priori' column which we then
                                  use to pick a threshold for the dataset. You
                                  can override this behaviour by either
                                  setting this value to a floating-point
                                  number in the range [0.0, 1.0], or to a
                                  string, naming one of the systems which will
                                  be used to calculate the threshold leading
                                  to the maximum F1-score and then applied to
                                  all other sets.
  -L, --plot-limits FLOAT...      If set, must be a 4-tuple containing the
                                  bounds of the plot for the x and y axis
                                  respectively (format: x_low, x_high, y_low,
                                  y_high]).  If not set, use normal bounds
                                  ([0, 1, 0, 1]) for the performance curve.
                                  [default: 0.0, 1.0, 0.0, 1.0]
  -v, --verbose                   Increase the verbosity level from 0 (only
                                  error messages) to 1 (warnings), 2 (log
                                  messages), 3 (debug information) by adding
                                  the --verbose option as often as desired
                                  (e.g. '-vvv' for debug).
  -?, -h, --help                  Show this message and exit.

  Examples:

      1. Compares system A and B, with their own pre-computed measure files:
  
         $ bob binseg compare -vv A path/to/A/train.csv B path/to/B/test.csv

Performance Difference Significance

Calculates the significance between results obtained through 2 systems on the same dataset.

$ bob binseg significance --help
Usage: bob binseg significance [OPTIONS] [CONFIG]...

  Evaluates how significantly different are two models on the same dataset

      This application calculates the significance of results of two models
      operating on the same dataset, and subject to a priori threshold
      tunning.

  It is possible to pass one or several Python files (or names of
  ``bob.ip.binseg.config`` entry points or module names i.e. import paths) as
  CONFIG arguments to this command line which contain the parameters listed
  below as Python variables. Available entry points are:

  **bob.ip.binseg** entry points are: chasedb1, chasedb1-1024, chasedb1-2nd,
  chasedb1-768, chasedb1-covd, chasedb1-mtest, chasedb1-xtest, combined-cup,
  combined-disc, combined-vessels, csv-dataset-example, cxr8, cxr8-idiap,
  cxr8-idiap-xtest, cxr8-xtest, drhagis, drionsdb, drionsdb-2nd,
  drionsdb-2nd-512, drionsdb-512, drionsdb-768, drishtigs1-cup,
  drishtigs1-cup-512, drishtigs1-cup-768, drishtigs1-cup-any, drishtigs1-disc,
  drishtigs1-disc-512, drishtigs1-disc-768, drishtigs1-disc-any, driu, driu-
  bn, driu-od, drive, drive-1024, drive-2nd, drive-768, drive-covd, drive-
  mtest, drive-xtest, hed, hrf, hrf-1024, hrf-768, hrf-covd, hrf-highres, hrf-
  mtest, hrf-xtest, iostar-disc, iostar-disc-512, iostar-disc-768, iostar-
  vessel, iostar-vessel-768, iostar-vessel-covd, iostar-vessel-mtest, iostar-
  vessel-xtest, jsrt, jsrt-xtest, lwnet, m2unet, montgomery, montgomery-xtest,
  refuge-cup, refuge-cup-512, refuge-cup-768, refuge-disc, refuge-disc-512,
  refuge-disc-768, resunet, rimoner3-cup, rimoner3-cup-2nd, rimoner3-cup-512,
  rimoner3-cup-768, rimoner3-disc, rimoner3-disc-2nd, rimoner3-disc-512,
  rimoner3-disc-768, shenzhen, shenzhen-small, shenzhen-xtest, stare,
  stare-1024, stare-2nd, stare-768, stare-covd, stare-mtest, stare-xtest, unet

  The options through the command-line (see below) will override the values of
  argument provided configuration files. You can run this command with
  ``<COMMAND> -H example_config.py`` to create a template config file.

Options:
  -n, --names TEXT...             Names of the two systems to compare
                                  [required]
  -p, --predictions DIRECTORY...  Path where predictions of system 2 are
                                  currently stored.  You may also input
                                  predictions from a second-annotator.  This
                                  application will adequately handle it.
                                  [required]
  -d, --dataset CUSTOM            A dictionary mapping string keys to
                                  torch.utils.data.dataset.Dataset instances
                                  [required]
  -t, --threshold TEXT            This number is used to define positives and
                                  negatives from probability maps, and report
                                  F1-scores (a priori). By default, we expect
                                  a set named 'validation' to be available at
                                  the input data. If that is not the case, we
                                  use 'train', if available.  You may provide
                                  the name of another dataset to be used for
                                  threshold tunning otherwise. If not set, or
                                  a string is input, threshold tunning is done
                                  per system, individually.  Optionally, you
                                  may also provide a floating-point number
                                  between [0.0, 1.0] as the threshold to use
                                  for both systems.  [default: validation;
                                  required]
  -e, --evaluate TEXT             Name of the dataset to evaluate  [default:
                                  test; required]
  -S, --steps INTEGER             This number is used to define the number of
                                  threshold steps to consider when evaluating
                                  the highest possible F1-score on train/test
                                  data.  [default: 1000; required]
  -s, --size INTEGER...           This is a tuple with two values indicating
                                  the size of windows to be used for sliding
                                  window analysis.  The values represent
                                  height and width respectively.  [default:
                                  128, 128; required]
  -t, --stride INTEGER...         This is a tuple with two values indicating
                                  the stride of windows to be used for sliding
                                  window analysis.  The values represent
                                  height and width respectively.  [default:
                                  32, 32; required]
  -f, --figure TEXT               The name of a performance figure (e.g.
                                  f1_score, or jaccard) to use when comparing
                                  performances  [default: accuracy; required]
  -o, --output-folder PATH        Path where to store visualizations
  -R, --remove-outliers / --no-remove-outliers
                                  If set, removes outliers from both score
                                  distributions before running statistical
                                  analysis.  Outlier removal follows a 1.5 IQR
                                  range check from the difference in figures
                                  between both systems and assumes most of the
                                  distribution is contained within that range
                                  (like in a normal distribution)  [default:
                                  no-remove-outliers; required]
  -R, --remove-zeros / --no-remove-zeros
                                  If set, removes instances from the
                                  statistical analysis in which both systems
                                  had a performance equal to zero.  [default:
                                  no-remove-zeros; required]
  -x, --parallel INTEGER          Set the number of parallel processes to use
                                  when running using multiprocessing.  A value
                                  of zero uses all reported cores.  [default:
                                  1; required]
  -k, --checkpoint-folder PATH    Path where to store checkpointed versions of
                                  sliding window performances
  -v, --verbose                   Increase the verbosity level from 0 (only
                                  error messages) to 1 (warnings), 2 (log
                                  messages), 3 (debug information) by adding
                                  the --verbose option as often as desired
                                  (e.g. '-vvv' for debug).
  -H, --dump-config FILENAME      Name of the config file to be generated
  -?, -h, --help                  Show this message and exit.

  Examples:

      1. Runs a significance test using as base the calculated predictions of two
         different systems, on the **same** dataset:
  
         $ bob binseg significance -vv drive --names system1 system2 --predictions=path/to/predictions/system-1 path/to/predictions/system-2
  
      2. By default, we use a "validation" dataset if it is available, to infer
         the a priori threshold for the comparison of two systems.  Otherwise,
         you may need to specify the name of a set to be used as validation set
         for choosing a threshold.  The same goes for the set to be used for
         testing the hypothesis - by default we use the "test" dataset if it is
         available, otherwise, specify.
  
         $ bob binseg significance -vv drive --names system1 system2 --predictions=path/to/predictions/system-1 path/to/predictions/system-2 --threshold=train --evaluate=alternate-test